Warning: Permanently added '54.162.182.102' (ED25519) to the list of known hosts. You can reproduce this build on your computer by running: sudo dnf install copr-rpmbuild /usr/bin/copr-rpmbuild --verbose --drop-resultdir --task-url https://copr.fedorainfracloud.org/backend/get-build-task/7325889-epel-8-aarch64 --chroot epel-8-aarch64 Version: 0.72 PID: 6853 Logging PID: 6854 Task: {'allow_user_ssh': False, 'appstream': False, 'background': False, 'build_id': 7325889, 'buildroot_pkgs': [], 'chroot': 'epel-8-aarch64', 'enable_net': True, 'fedora_review': False, 'git_hash': 'fe22476c3c0b61ebd0a9858693b287e4007599c0', 'git_repo': 'https://copr-dist-git.fedorainfracloud.org/git/rezso/ML/cutlass', 'isolation': 'default', 'memory_reqs': 2048, 'package_name': 'cutlass', 'package_version': '3.5.0-20240411.1.cu12_4', 'project_dirname': 'ML', 'project_name': 'ML', 'project_owner': 'rezso', 'repo_priority': None, 'repos': [{'baseurl': 'https://download.copr.fedorainfracloud.org/results/rezso/ML/epel-8-aarch64/', 'id': 'copr_base', 'name': 'Copr repository', 'priority': None}, {'baseurl': 'https://download.copr.fedorainfracloud.org/results/rezso/CUDA/epel-8-aarch64/', 'id': 'copr_rezso_CUDA', 'name': 'Additional repo copr_rezso_CUDA'}, {'baseurl': 'http://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64', 'id': 'http_developer_download_nvidia_com_compute_cuda_repos_rhel8_x86_64', 'name': 'Additional repo http_developer_download_nvidia_com_compute_cuda_repos_rhel8_x86_64'}, {'baseurl': 'http://developer.download.nvidia.com/compute/cuda/repos/rhel8/sbsa', 'id': 'http_developer_download_nvidia_com_compute_cuda_repos_rhel8_sbsa', 'name': 'Additional repo http_developer_download_nvidia_com_compute_cuda_repos_rhel8_sbsa'}, {'baseurl': 'http://developer.download.nvidia.com/compute/cuda/repos/rhel8/ppc64le', 'id': 'http_developer_download_nvidia_com_compute_cuda_repos_rhel8_ppc64le', 'name': 'Additional repo http_developer_download_nvidia_com_compute_cuda_repos_rhel8_ppc64le'}], 'sandbox': 'rezso/ML--rezso', 'source_json': {}, 'source_type': None, 'ssh_public_keys': None, 'submitter': 'rezso', 'tags': [], 'task_id': '7325889-epel-8-aarch64', 'timeout': 172800, 'uses_devel_repo': False, 'with_opts': [], 'without_opts': []} Running: git clone https://copr-dist-git.fedorainfracloud.org/git/rezso/ML/cutlass /var/lib/copr-rpmbuild/workspace/workdir-zcskrd06/cutlass --depth 500 --no-single-branch --recursive cmd: ['git', 'clone', 'https://copr-dist-git.fedorainfracloud.org/git/rezso/ML/cutlass', '/var/lib/copr-rpmbuild/workspace/workdir-zcskrd06/cutlass', '--depth', '500', '--no-single-branch', '--recursive'] cwd: . rc: 0 stdout: stderr: Cloning into '/var/lib/copr-rpmbuild/workspace/workdir-zcskrd06/cutlass'... Running: git checkout fe22476c3c0b61ebd0a9858693b287e4007599c0 -- cmd: ['git', 'checkout', 'fe22476c3c0b61ebd0a9858693b287e4007599c0', '--'] cwd: /var/lib/copr-rpmbuild/workspace/workdir-zcskrd06/cutlass rc: 0 stdout: stderr: Note: switching to 'fe22476c3c0b61ebd0a9858693b287e4007599c0'. You are in 'detached HEAD' state. You can look around, make experimental changes and commit them, and you can discard any commits you make in this state without impacting any branches by switching back to a branch. If you want to create a new branch to retain commits you create, you may do so (now or later) by using -c with the switch command. Example: git switch -c Or undo this operation with: git switch - Turn off this advice by setting config variable advice.detachedHead to false HEAD is now at fe22476 automatic import of cutlass Running: copr-distgit-client sources cmd: ['copr-distgit-client', 'sources'] cwd: /var/lib/copr-rpmbuild/workspace/workdir-zcskrd06/cutlass rc: 0 stdout: stderr: INFO: Reading stdout from command: git rev-parse --abbrev-ref HEAD INFO: Reading stdout from command: git rev-parse HEAD INFO: Reading sources specification file: sources /usr/bin/tail: /var/lib/copr-rpmbuild/main.log: file truncated Running (timeout=172800): unbuffer mock --spec /var/lib/copr-rpmbuild/workspace/workdir-zcskrd06/cutlass/cutlass.spec --sources /var/lib/copr-rpmbuild/workspace/workdir-zcskrd06/cutlass --resultdir /var/lib/copr-rpmbuild/results --uniqueext 1713469169.948153 -r /var/lib/copr-rpmbuild/results/configs/child.cfg INFO: mock.py version 5.5 starting (python version = 3.12.1, NVR = mock-5.5-1.fc39), args: /usr/libexec/mock/mock --spec /var/lib/copr-rpmbuild/workspace/workdir-zcskrd06/cutlass/cutlass.spec --sources /var/lib/copr-rpmbuild/workspace/workdir-zcskrd06/cutlass --resultdir /var/lib/copr-rpmbuild/results --uniqueext 1713469169.948153 -r /var/lib/copr-rpmbuild/results/configs/child.cfg Start(bootstrap): init plugins INFO: tmpfs initialized INFO: selinux enabled INFO: chroot_scan: initialized INFO: compress_logs: initialized Finish(bootstrap): init plugins Start: init plugins INFO: tmpfs initialized INFO: selinux enabled INFO: chroot_scan: initialized INFO: compress_logs: initialized Finish: init plugins INFO: Signal handler active Start: run INFO: Start(/var/lib/copr-rpmbuild/workspace/workdir-zcskrd06/cutlass/cutlass.spec) Config(rhel+epel-8-aarch64) Start: clean chroot Finish: clean chroot Mock Version: 5.5 INFO: Mock Version: 5.5 Start(bootstrap): chroot init INFO: mounting tmpfs at /var/lib/mock/rhel+epel-8-aarch64-bootstrap-1713469169.948153/root. INFO: calling preinit hooks INFO: enabled root cache INFO: enabled package manager cache Start(bootstrap): cleaning package manager metadata Finish(bootstrap): cleaning package manager metadata INFO: Guessed host environment type: unknown INFO: Using bootstrap image: registry.access.redhat.com/ubi8/ubi INFO: Pulling image: registry.access.redhat.com/ubi8/ubi INFO: Copy content of container registry.access.redhat.com/ubi8/ubi to /var/lib/mock/rhel+epel-8-aarch64-bootstrap-1713469169.948153/root INFO: Checking that registry.access.redhat.com/ubi8/ubi image matches host's architecture INFO: mounting registry.access.redhat.com/ubi8/ubi with podman image mount INFO: image registry.access.redhat.com/ubi8/ubi as /var/lib/containers/storage/overlay/7a2b39246d28ad6bab256f11e346ac22a58bb61a7e4b4293c3875c8fcb1ec2fe/merged INFO: umounting image registry.access.redhat.com/ubi8/ubi (/var/lib/containers/storage/overlay/7a2b39246d28ad6bab256f11e346ac22a58bb61a7e4b4293c3875c8fcb1ec2fe/merged) with podman image umount INFO: Package manager dnf detected and used (fallback) INFO: Not updating bootstrap chroot, bootstrap_image_ready=True Start(bootstrap): creating root cache Finish(bootstrap): creating root cache Finish(bootstrap): chroot init Start: chroot init INFO: mounting tmpfs at /var/lib/mock/rhel+epel-8-aarch64-1713469169.948153/root. INFO: calling preinit hooks INFO: enabled root cache INFO: enabled package manager cache Start: cleaning package manager metadata Finish: cleaning package manager metadata INFO: enabled HW Info plugin INFO: Package manager dnf detected and used (direct choice) INFO: Buildroot is handled by package management downloaded with a bootstrap image: rpm-4.14.3-28.el8_9.aarch64 python3-dnf-4.7.0-19.el8.noarch python3-dnf-plugins-core-4.0.21-23.el8.noarch yum-4.7.0-19.el8.noarch Start: installing minimal buildroot with dnf No matches found for the following disable plugin patterns: local, spacewalk, versionlock Updating Subscription Management repositories. Unable to read consumer identity This system is not registered with an entitlement server. You can use subscription-manager to register. Copr repository 2.4 MB/s | 939 kB 00:00 Additional repo copr_rezso_CUDA 234 kB/s | 60 kB 00:00 Additional repo http_developer_download_nvidia_ 120 MB/s | 3.3 MB 00:00 Additional repo http_developer_download_nvidia_ 91 MB/s | 2.0 MB 00:00 Additional repo http_developer_download_nvidia_ 81 MB/s | 1.8 MB 00:00 Red Hat Enterprise Linux - BaseOS 77 MB/s | 53 MB 00:00 Red Hat Enterprise Linux - AppStream 93 MB/s | 52 MB 00:00 Red Hat Enterprise Linux - CodeReady Linux Buil 11 MB/s | 5.6 MB 00:00 Extra Packages for Enterprise Linux 8 - aarch64 78 MB/s | 16 MB 00:00 Modular dependency problems: Problem 1: nothing provides requested module(nvidia-driver:latest-dkms:20240416083839) Problem 2: nothing provides requested module(nvidia-driver:latest-dkms:20240416084208) Dependencies resolved. ============================================================================================ Package Arch Version Repository Size ============================================================================================ Installing: bash aarch64 4.4.20-4.el8_6 rhel-baseos 1.5 M bzip2 aarch64 1.0.6-26.el8 rhel-baseos 60 k coreutils aarch64 8.30-15.el8 rhel-baseos 1.2 M cpio aarch64 2.12-11.el8 rhel-baseos 260 k diffutils aarch64 3.6-6.el8 rhel-baseos 352 k epel-rpm-macros noarch 8-41 epel 27 k findutils aarch64 1:4.6.0-21.el8 rhel-baseos 524 k gawk aarch64 4.2.1-4.el8 rhel-baseos 1.1 M gcc aarch64 8.5.0-20.el8 rhel-appstream 19 M gcc-c++ aarch64 8.5.0-20.el8 rhel-appstream 11 M grep aarch64 3.1-6.el8 rhel-baseos 268 k gzip aarch64 1.9-13.el8_5 rhel-baseos 165 k info aarch64 6.5-7.el8 rhel-baseos 191 k make aarch64 1:4.2.1-11.el8 rhel-baseos 490 k patch aarch64 2.7.6-11.el8 rhel-baseos 134 k redhat-release aarch64 8.9-0.1.el8 rhel-baseos 45 k redhat-rpm-config noarch 131-1.el8 rhel-appstream 91 k rpm-build aarch64 4.14.3-28.el8_9 rhel-appstream 173 k sed aarch64 4.5-5.el8 rhel-baseos 295 k tar aarch64 2:1.30-9.el8 rhel-baseos 830 k unzip aarch64 6.0-46.el8 rhel-baseos 190 k util-linux aarch64 2.32.1-44.el8_9.1 rhel-baseos 2.5 M which aarch64 2.21-20.el8 rhel-baseos 49 k xz aarch64 5.2.4-4.el8_6 rhel-baseos 153 k Installing dependencies: annobin aarch64 11.13-2.el8 rhel-appstream 971 k ansible-srpm-macros noarch 1-12.el8 epel 21 k audit-libs aarch64 3.0.7-5.el8 rhel-baseos 119 k basesystem noarch 11-5.el8 rhel-baseos 11 k binutils aarch64 2.30-123.el8 rhel-baseos 6.1 M brotli aarch64 1.0.6-3.el8 rhel-baseos 314 k bzip2-libs aarch64 1.0.6-26.el8 rhel-baseos 48 k ca-certificates noarch 2023.2.60_v7.0.306-80.0.el8_8 rhel-baseos 935 k chkconfig aarch64 1.19.2-1.el8 rhel-baseos 197 k coreutils-common aarch64 8.30-15.el8 rhel-baseos 2.0 M cpp aarch64 8.5.0-20.el8 rhel-appstream 9.0 M cracklib aarch64 2.9.6-15.el8 rhel-baseos 93 k cracklib-dicts aarch64 2.9.6-15.el8 rhel-baseos 4.0 M crypto-policies noarch 20230731-1.git3177e06.el8 rhel-baseos 64 k curl aarch64 7.61.1-33.el8_9.5 rhel-baseos 350 k cyrus-sasl-lib aarch64 2.1.27-6.el8_5 rhel-baseos 122 k dwz aarch64 0.12-10.el8 rhel-appstream 103 k efi-srpm-macros noarch 3-3.el8 rhel-appstream 22 k elfutils aarch64 0.189-3.el8 rhel-baseos 537 k elfutils-default-yama-scope noarch 0.189-3.el8 rhel-baseos 52 k elfutils-libelf aarch64 0.189-3.el8 rhel-baseos 231 k elfutils-libs aarch64 0.189-3.el8 rhel-baseos 292 k expat aarch64 2.2.5-11.el8_9.1 rhel-baseos 104 k file aarch64 5.33-25.el8 rhel-baseos 78 k file-libs aarch64 5.33-25.el8 rhel-baseos 541 k filesystem aarch64 3.8-6.el8 rhel-baseos 1.1 M fpc-srpm-macros noarch 1.3-1.el8 epel 8.2 k gc aarch64 7.6.4-3.el8 rhel-appstream 99 k gcc-plugin-annobin aarch64 8.5.0-20.el8 rhel-appstream 34 k gdb-headless aarch64 8.2-20.el8 rhel-appstream 3.1 M gdbm aarch64 1:1.18-2.el8 rhel-baseos 128 k gdbm-libs aarch64 1:1.18-2.el8 rhel-baseos 59 k ghc-srpm-macros noarch 1.4.2-7.el8 rhel-appstream 9.4 k glib2 aarch64 2.56.4-161.el8 rhel-baseos 2.4 M glibc aarch64 2.28-236.el8_9.12 rhel-baseos 1.8 M glibc-all-langpacks aarch64 2.28-236.el8_9.12 rhel-baseos 25 M glibc-common aarch64 2.28-236.el8_9.12 rhel-baseos 1.0 M glibc-devel aarch64 2.28-236.el8_9.12 rhel-baseos 84 k glibc-gconv-extra aarch64 2.28-236.el8_9.12 rhel-baseos 1.8 M glibc-headers aarch64 2.28-236.el8_9.12 rhel-baseos 482 k gmp aarch64 1:6.1.2-10.el8 rhel-baseos 270 k gnupg2 aarch64 2.2.20-3.el8_6 rhel-baseos 2.4 M gnutls aarch64 3.6.16-8.el8_9.3 rhel-baseos 940 k go-srpm-macros noarch 2-17.el8 rhel-appstream 13 k guile aarch64 5:2.0.14-7.el8 rhel-appstream 3.5 M ima-evm-utils aarch64 1.3.2-12.el8 rhel-baseos 63 k isl aarch64 0.16.1-6.el8 rhel-appstream 778 k kernel-headers aarch64 4.18.0-513.24.1.el8_9 rhel-baseos 11 M keyutils-libs aarch64 1.5.10-9.el8 rhel-baseos 34 k krb5-libs aarch64 1.18.2-26.el8_9 rhel-baseos 818 k libacl aarch64 2.2.53-1.el8 rhel-baseos 34 k libarchive aarch64 3.3.3-5.el8 rhel-baseos 340 k libasan aarch64 8.5.0-20.el8 rhel-baseos 387 k libassuan aarch64 2.5.1-3.el8 rhel-baseos 81 k libatomic aarch64 8.5.0-20.el8 rhel-baseos 26 k libatomic_ops aarch64 7.6.2-3.el8 rhel-appstream 38 k libattr aarch64 2.4.48-3.el8 rhel-baseos 27 k libbabeltrace aarch64 1.5.4-4.el8 rhel-baseos 189 k libblkid aarch64 2.32.1-44.el8_9.1 rhel-baseos 215 k libcap aarch64 2.48-6.el8_9 rhel-baseos 74 k libcap-ng aarch64 0.7.11-1.el8 rhel-baseos 33 k libcom_err aarch64 1.45.6-5.el8 rhel-baseos 49 k libcurl aarch64 7.61.1-33.el8_9.5 rhel-baseos 286 k libdb aarch64 5.3.28-42.el8_4 rhel-baseos 687 k libdb-utils aarch64 5.3.28-42.el8_4 rhel-baseos 148 k libfdisk aarch64 2.32.1-44.el8_9.1 rhel-baseos 244 k libffi aarch64 3.1-24.el8 rhel-baseos 37 k libgcc aarch64 8.5.0-20.el8 rhel-baseos 75 k libgcrypt aarch64 1.8.5-7.el8_6 rhel-baseos 391 k libgomp aarch64 8.5.0-20.el8 rhel-baseos 200 k libgpg-error aarch64 1.31-1.el8 rhel-baseos 240 k libidn2 aarch64 2.2.0-1.el8 rhel-baseos 93 k libksba aarch64 1.3.5-9.el8_7 rhel-baseos 130 k libmount aarch64 2.32.1-44.el8_9.1 rhel-baseos 230 k libmpc aarch64 1.1.0-9.1.el8 rhel-appstream 60 k libnghttp2 aarch64 1.33.0-5.el8_9 rhel-baseos 75 k libnsl2 aarch64 1.2.0-2.20180605git4a062cf.el8 rhel-baseos 55 k libpkgconf aarch64 1.4.2-1.el8 rhel-baseos 34 k libpsl aarch64 0.20.2-6.el8 rhel-baseos 61 k libpwquality aarch64 1.4.4-6.el8 rhel-baseos 106 k libselinux aarch64 2.9-8.el8 rhel-baseos 162 k libsemanage aarch64 2.9-9.el8_6 rhel-baseos 164 k libsepol aarch64 2.9-3.el8 rhel-baseos 321 k libsigsegv aarch64 2.11-5.el8 rhel-baseos 30 k libsmartcols aarch64 2.32.1-44.el8_9.1 rhel-baseos 175 k libssh aarch64 0.9.6-13.el8_9 rhel-baseos 210 k libssh-config noarch 0.9.6-13.el8_9 rhel-baseos 21 k libstdc++ aarch64 8.5.0-20.el8 rhel-baseos 425 k libstdc++-devel aarch64 8.5.0-20.el8 rhel-appstream 2.1 M libtasn1 aarch64 4.13-4.el8_7 rhel-baseos 75 k libtirpc aarch64 1.1.4-8.el8 rhel-baseos 109 k libtool-ltdl aarch64 2.4.6-25.el8 rhel-baseos 57 k libubsan aarch64 8.5.0-20.el8 rhel-baseos 145 k libunistring aarch64 0.9.9-3.el8 rhel-baseos 411 k libusbx aarch64 1.0.23-4.el8 rhel-baseos 73 k libutempter aarch64 1.1.6-14.el8 rhel-baseos 32 k libuuid aarch64 2.32.1-44.el8_9.1 rhel-baseos 98 k libverto aarch64 0.3.2-2.el8 rhel-baseos 24 k libxcrypt aarch64 4.1.1-6.el8 rhel-baseos 73 k libxcrypt-devel aarch64 4.1.1-6.el8 rhel-baseos 25 k libxml2 aarch64 2.9.7-18.el8_9 rhel-baseos 653 k libzstd aarch64 1.4.4-1.el8 rhel-baseos 240 k lua-libs aarch64 5.3.4-12.el8 rhel-baseos 112 k lua-srpm-macros noarch 1-13.el8 epel 9.2 k lz4-libs aarch64 1.8.3-3.el8_4 rhel-baseos 63 k mpfr aarch64 3.1.6-1.el8 rhel-baseos 214 k ncurses aarch64 6.1-10.20180224.el8 rhel-baseos 383 k ncurses-base noarch 6.1-10.20180224.el8 rhel-baseos 81 k ncurses-libs aarch64 6.1-10.20180224.el8 rhel-baseos 310 k nettle aarch64 3.4.1-7.el8 rhel-baseos 307 k npth aarch64 1.5-4.el8 rhel-baseos 26 k ocaml-srpm-macros noarch 5-4.el8 rhel-appstream 9.5 k openblas-srpm-macros noarch 2-2.el8 rhel-appstream 8.0 k openldap aarch64 2.4.46-18.el8 rhel-baseos 339 k openssl-libs aarch64 1:1.1.1k-12.el8_9 rhel-baseos 1.3 M p11-kit aarch64 0.23.22-1.el8 rhel-baseos 306 k p11-kit-trust aarch64 0.23.22-1.el8 rhel-baseos 134 k pam aarch64 1.3.1-27.el8 rhel-baseos 740 k pcre aarch64 8.42-6.el8 rhel-baseos 187 k pcre2 aarch64 10.32-3.el8_6 rhel-baseos 219 k perl-srpm-macros noarch 1-25.el8 rhel-appstream 11 k pkgconf aarch64 1.4.2-1.el8 rhel-baseos 37 k pkgconf-m4 noarch 1.4.2-1.el8 rhel-baseos 17 k pkgconf-pkg-config aarch64 1.4.2-1.el8 rhel-baseos 15 k platform-python aarch64 3.6.8-56.el8_9.3 rhel-baseos 87 k platform-python-setuptools noarch 39.2.0-7.el8 rhel-baseos 632 k popt aarch64 1.18-1.el8 rhel-baseos 60 k publicsuffix-list-dafsa noarch 20180723-1.el8 rhel-baseos 56 k python-rpm-macros noarch 3-45.el8 rhel-appstream 16 k python-srpm-macros noarch 3-45.el8 rhel-appstream 16 k python3-libs aarch64 3.6.8-56.el8_9.3 rhel-baseos 7.7 M python3-pip-wheel noarch 9.0.3-23.el8_9.1 rhel-baseos 866 k python3-rpm-macros noarch 3-45.el8 rhel-appstream 15 k python3-setuptools-wheel noarch 39.2.0-7.el8 rhel-baseos 289 k qt5-srpm-macros noarch 5.15.3-1.el8 rhel-appstream 11 k readline aarch64 7.0-10.el8 rhel-baseos 193 k rpm aarch64 4.14.3-28.el8_9 rhel-baseos 544 k rpm-build-libs aarch64 4.14.3-28.el8_9 rhel-baseos 151 k rpm-libs aarch64 4.14.3-28.el8_9 rhel-baseos 330 k rust-srpm-macros noarch 5-2.el8 rhel-appstream 9.3 k setup noarch 2.12.2-9.el8 rhel-baseos 181 k shadow-utils aarch64 2:4.6-19.el8 rhel-baseos 1.2 M sqlite-libs aarch64 3.26.0-19.el8_9 rhel-baseos 551 k systemd-libs aarch64 239-78.el8 rhel-baseos 1.0 M tpm2-tss aarch64 2.3.2-5.el8 rhel-baseos 240 k tzdata noarch 2024a-1.el8 rhel-baseos 475 k xz-libs aarch64 5.2.4-4.el8_6 rhel-baseos 91 k zip aarch64 3.0-23.el8 rhel-baseos 265 k zlib aarch64 1.2.11-25.el8 rhel-baseos 101 k zstd aarch64 1.4.4-1.el8 rhel-appstream 303 k Transaction Summary ============================================================================================ Install 174 Packages Total download size: 155 M Installed size: 825 M Downloading Packages: (1/174): bzip2-libs-1.0.6-26.el8.aarch64.rpm 441 kB/s | 48 kB 00:00 (2/174): bzip2-1.0.6-26.el8.aarch64.rpm 511 kB/s | 60 kB 00:00 (3/174): cracklib-2.9.6-15.el8.aarch64.rpm 745 kB/s | 93 kB 00:00 (4/174): grep-3.1-6.el8.aarch64.rpm 2.9 MB/s | 268 kB 00:00 (5/174): cracklib-dicts-2.9.6-15.el8.aarch64.rp 31 MB/s | 4.0 MB 00:00 (6/174): libassuan-2.5.1-3.el8.aarch64.rpm 1.4 MB/s | 81 kB 00:00 (7/174): libacl-2.2.53-1.el8.aarch64.rpm 211 kB/s | 34 kB 00:00 (8/174): libattr-2.4.48-3.el8.aarch64.rpm 510 kB/s | 27 kB 00:00 (9/174): libgpg-error-1.31-1.el8.aarch64.rpm 2.4 MB/s | 240 kB 00:00 (10/174): libpkgconf-1.4.2-1.el8.aarch64.rpm 425 kB/s | 34 kB 00:00 (11/174): libnsl2-1.2.0-2.20180605git4a062cf.el 576 kB/s | 55 kB 00:00 (12/174): libsigsegv-2.11-5.el8.aarch64.rpm 571 kB/s | 30 kB 00:00 (13/174): libutempter-1.1.6-14.el8.aarch64.rpm 654 kB/s | 32 kB 00:00 (14/174): libunistring-0.9.9-3.el8.aarch64.rpm 4.7 MB/s | 411 kB 00:00 (15/174): libtool-ltdl-2.4.6-25.el8.aarch64.rpm 590 kB/s | 57 kB 00:00 (16/174): mpfr-3.1.6-1.el8.aarch64.rpm 4.0 MB/s | 214 kB 00:00 (17/174): pkgconf-1.4.2-1.el8.aarch64.rpm 784 kB/s | 37 kB 00:00 (18/174): npth-1.5-4.el8.aarch64.rpm 334 kB/s | 26 kB 00:00 (19/174): pkgconf-pkg-config-1.4.2-1.el8.aarch6 319 kB/s | 15 kB 00:00 (20/174): readline-7.0-10.el8.aarch64.rpm 3.5 MB/s | 193 kB 00:00 (21/174): zip-3.0-23.el8.aarch64.rpm 4.8 MB/s | 265 kB 00:00 (22/174): pkgconf-m4-1.4.2-1.el8.noarch.rpm 318 kB/s | 17 kB 00:00 (23/174): basesystem-11-5.el8.noarch.rpm 108 kB/s | 11 kB 00:00 (24/174): gmp-6.1.2-10.el8.aarch64.rpm 4.7 MB/s | 270 kB 00:00 (25/174): publicsuffix-list-dafsa-20180723-1.el 560 kB/s | 56 kB 00:00 (26/174): libidn2-2.2.0-1.el8.aarch64.rpm 1.4 MB/s | 93 kB 00:00 (27/174): patch-2.7.6-11.el8.aarch64.rpm 2.4 MB/s | 134 kB 00:00 (28/174): diffutils-3.6-6.el8.aarch64.rpm 4.4 MB/s | 352 kB 00:00 (29/174): libpsl-0.20.2-6.el8.aarch64.rpm 899 kB/s | 61 kB 00:00 (30/174): libusbx-1.0.23-4.el8.aarch64.rpm 942 kB/s | 73 kB 00:00 (31/174): brotli-1.0.6-3.el8.aarch64.rpm 4.9 MB/s | 314 kB 00:00 (32/174): libzstd-1.4.4-1.el8.aarch64.rpm 2.3 MB/s | 240 kB 00:00 (33/174): ima-evm-utils-1.3.2-12.el8.aarch64.rp 1.2 MB/s | 63 kB 00:00 (34/174): popt-1.18-1.el8.aarch64.rpm 1.1 MB/s | 60 kB 00:00 (35/174): p11-kit-trust-0.23.22-1.el8.aarch64.r 1.1 MB/s | 134 kB 00:00 (36/174): libdb-utils-5.3.28-42.el8_4.aarch64.r 2.7 MB/s | 148 kB 00:00 (37/174): libdb-5.3.28-42.el8_4.aarch64.rpm 5.4 MB/s | 687 kB 00:00 (38/174): libsepol-2.9-3.el8.aarch64.rpm 5.9 MB/s | 321 kB 00:00 (39/174): lz4-libs-1.8.3-3.el8_4.aarch64.rpm 1.1 MB/s | 63 kB 00:00 (40/174): nettle-3.4.1-7.el8.aarch64.rpm 4.9 MB/s | 307 kB 00:00 (41/174): openldap-2.4.46-18.el8.aarch64.rpm 6.3 MB/s | 339 kB 00:00 (42/174): p11-kit-0.23.22-1.el8.aarch64.rpm 4.9 MB/s | 306 kB 00:00 (43/174): pcre-8.42-6.el8.aarch64.rpm 3.6 MB/s | 187 kB 00:00 (44/174): cyrus-sasl-lib-2.1.27-6.el8_5.aarch64 2.4 MB/s | 122 kB 00:00 (45/174): filesystem-3.8-6.el8.aarch64.rpm 18 MB/s | 1.1 MB 00:00 (46/174): gzip-1.9-13.el8_5.aarch64.rpm 3.0 MB/s | 165 kB 00:00 (47/174): libcap-ng-0.7.11-1.el8.aarch64.rpm 496 kB/s | 33 kB 00:00 (48/174): keyutils-libs-1.5.10-9.el8.aarch64.rp 373 kB/s | 34 kB 00:00 (49/174): libxcrypt-4.1.1-6.el8.aarch64.rpm 1.4 MB/s | 73 kB 00:00 (50/174): libxcrypt-devel-4.1.1-6.el8.aarch64.r 508 kB/s | 25 kB 00:00 (51/174): make-4.2.1-11.el8.aarch64.rpm 5.5 MB/s | 490 kB 00:00 (52/174): lua-libs-5.3.4-12.el8.aarch64.rpm 1.1 MB/s | 112 kB 00:00 (53/174): info-6.5-7.el8.aarch64.rpm 3.3 MB/s | 191 kB 00:00 (54/174): gawk-4.2.1-4.el8.aarch64.rpm 14 MB/s | 1.1 MB 00:00 (55/174): cpio-2.12-11.el8.aarch64.rpm 1.7 MB/s | 260 kB 00:00 (56/174): sed-4.5-5.el8.aarch64.rpm 2.6 MB/s | 295 kB 00:00 (57/174): xz-5.2.4-4.el8_6.aarch64.rpm 2.1 MB/s | 153 kB 00:00 (58/174): unzip-6.0-46.el8.aarch64.rpm 1.5 MB/s | 190 kB 00:00 (59/174): xz-libs-5.2.4-4.el8_6.aarch64.rpm 1.3 MB/s | 91 kB 00:00 (60/174): bash-4.4.20-4.el8_6.aarch64.rpm 22 MB/s | 1.5 MB 00:00 (61/174): gdbm-1.18-2.el8.aarch64.rpm 1.4 MB/s | 128 kB 00:00 (62/174): gnupg2-2.2.20-3.el8_6.aarch64.rpm 33 MB/s | 2.4 MB 00:00 (63/174): libcom_err-1.45.6-5.el8.aarch64.rpm 1.0 MB/s | 49 kB 00:00 (64/174): gdbm-libs-1.18-2.el8.aarch64.rpm 358 kB/s | 59 kB 00:00 (65/174): libgcrypt-1.8.5-7.el8_6.aarch64.rpm 5.2 MB/s | 391 kB 00:00 (66/174): libbabeltrace-1.5.4-4.el8.aarch64.rpm 1.2 MB/s | 189 kB 00:00 (67/174): libsemanage-2.9-9.el8_6.aarch64.rpm 2.9 MB/s | 164 kB 00:00 (68/174): libtirpc-1.1.4-8.el8.aarch64.rpm 2.2 MB/s | 109 kB 00:00 (69/174): libverto-0.3.2-2.el8.aarch64.rpm 376 kB/s | 24 kB 00:00 (70/174): pcre2-10.32-3.el8_6.aarch64.rpm 3.1 MB/s | 219 kB 00:00 (71/174): coreutils-8.30-15.el8.aarch64.rpm 17 MB/s | 1.2 MB 00:00 (72/174): glib2-2.56.4-161.el8.aarch64.rpm 30 MB/s | 2.4 MB 00:00 (73/174): coreutils-common-8.30-15.el8.aarch64. 19 MB/s | 2.0 MB 00:00 (74/174): libffi-3.1-24.el8.aarch64.rpm 569 kB/s | 37 kB 00:00 (75/174): libksba-1.3.5-9.el8_7.aarch64.rpm 2.5 MB/s | 130 kB 00:00 (76/174): libselinux-2.9-8.el8.aarch64.rpm 2.8 MB/s | 162 kB 00:00 (77/174): libtasn1-4.13-4.el8_7.aarch64.rpm 1.1 MB/s | 75 kB 00:00 (78/174): libpwquality-1.4.4-6.el8.aarch64.rpm 832 kB/s | 106 kB 00:00 (79/174): setup-2.12.2-9.el8.noarch.rpm 3.4 MB/s | 181 kB 00:00 (80/174): tar-1.30-9.el8.aarch64.rpm 11 MB/s | 830 kB 00:00 (81/174): platform-python-setuptools-39.2.0-7.e 3.2 MB/s | 632 kB 00:00 (82/174): python3-setuptools-wheel-39.2.0-7.el8 1.5 MB/s | 289 kB 00:00 (83/174): ca-certificates-2023.2.60_v7.0.306-80 14 MB/s | 935 kB 00:00 (84/174): audit-libs-3.0.7-5.el8.aarch64.rpm 1.4 MB/s | 119 kB 00:00 (85/174): chkconfig-1.19.2-1.el8.aarch64.rpm 2.9 MB/s | 197 kB 00:00 (86/174): elfutils-0.189-3.el8.aarch64.rpm 9.0 MB/s | 537 kB 00:00 (87/174): crypto-policies-20230731-1.git3177e06 927 kB/s | 64 kB 00:00 (88/174): file-5.33-25.el8.aarch64.rpm 1.3 MB/s | 78 kB 00:00 (89/174): file-libs-5.33-25.el8.aarch64.rpm 9.7 MB/s | 541 kB 00:00 (90/174): elfutils-libs-0.189-3.el8.aarch64.rpm 2.0 MB/s | 292 kB 00:00 (91/174): libgomp-8.5.0-20.el8.aarch64.rpm 1.7 MB/s | 200 kB 00:00 (92/174): libarchive-3.3.3-5.el8.aarch64.rpm 2.4 MB/s | 340 kB 00:00 (93/174): ncurses-6.1-10.20180224.el8.aarch64.r 5.3 MB/s | 383 kB 00:00 (94/174): tpm2-tss-2.3.2-5.el8.aarch64.rpm 4.0 MB/s | 240 kB 00:00 (95/174): ncurses-libs-6.1-10.20180224.el8.aarc 2.6 MB/s | 310 kB 00:00 (96/174): libnghttp2-1.33.0-5.el8_9.aarch64.rpm 342 kB/s | 75 kB 00:00 (97/174): which-2.21-20.el8.aarch64.rpm 941 kB/s | 49 kB 00:00 (98/174): zlib-1.2.11-25.el8.aarch64.rpm 1.7 MB/s | 101 kB 00:00 (99/174): elfutils-libelf-0.189-3.el8.aarch64.r 4.6 MB/s | 231 kB 00:00 (100/174): binutils-2.30-123.el8.aarch64.rpm 66 MB/s | 6.1 MB 00:00 (101/174): elfutils-default-yama-scope-0.189-3. 499 kB/s | 52 kB 00:00 (102/174): krb5-libs-1.18.2-26.el8_9.aarch64.rp 15 MB/s | 818 kB 00:00 (103/174): libasan-8.5.0-20.el8.aarch64.rpm 5.7 MB/s | 387 kB 00:00 (104/174): findutils-4.6.0-21.el8.aarch64.rpm 4.5 MB/s | 524 kB 00:00 (105/174): libcap-2.48-6.el8_9.aarch64.rpm 907 kB/s | 74 kB 00:00 (106/174): libgcc-8.5.0-20.el8.aarch64.rpm 942 kB/s | 75 kB 00:00 (107/174): libatomic-8.5.0-20.el8.aarch64.rpm 142 kB/s | 26 kB 00:00 (108/174): libubsan-8.5.0-20.el8.aarch64.rpm 1.8 MB/s | 145 kB 00:00 (109/174): libstdc++-8.5.0-20.el8.aarch64.rpm 3.8 MB/s | 425 kB 00:00 (110/174): ncurses-base-6.1-10.20180224.el8.noa 1.5 MB/s | 81 kB 00:00 (111/174): openssl-libs-1.1.1k-12.el8_9.aarch64 18 MB/s | 1.3 MB 00:00 (112/174): pam-1.3.1-27.el8.aarch64.rpm 14 MB/s | 740 kB 00:00 (113/174): libxml2-2.9.7-18.el8_9.aarch64.rpm 4.1 MB/s | 653 kB 00:00 (114/174): platform-python-3.6.8-56.el8_9.3.aar 1.5 MB/s | 87 kB 00:00 (115/174): redhat-release-8.9-0.1.el8.aarch64.r 893 kB/s | 45 kB 00:00 (116/174): python3-libs-3.6.8-56.el8_9.3.aarch6 78 MB/s | 7.7 MB 00:00 (117/174): sqlite-libs-3.26.0-19.el8_9.aarch64. 8.7 MB/s | 551 kB 00:00 (118/174): shadow-utils-4.6-19.el8.aarch64.rpm 13 MB/s | 1.2 MB 00:00 (119/174): libssh-config-0.9.6-13.el8_9.noarch. 338 kB/s | 21 kB 00:00 (120/174): libssh-0.9.6-13.el8_9.aarch64.rpm 2.3 MB/s | 210 kB 00:00 (121/174): systemd-libs-239-78.el8.aarch64.rpm 6.7 MB/s | 1.0 MB 00:00 (122/174): rpm-4.14.3-28.el8_9.aarch64.rpm 6.7 MB/s | 544 kB 00:00 (123/174): rpm-libs-4.14.3-28.el8_9.aarch64.rpm 6.6 MB/s | 330 kB 00:00 (124/174): rpm-build-libs-4.14.3-28.el8_9.aarch 1.9 MB/s | 151 kB 00:00 (125/174): glibc-2.28-236.el8_9.12.aarch64.rpm 31 MB/s | 1.8 MB 00:00 (126/174): glibc-common-2.28-236.el8_9.12.aarch 8.2 MB/s | 1.0 MB 00:00 (127/174): glibc-all-langpacks-2.28-236.el8_9.1 124 MB/s | 25 MB 00:00 (128/174): glibc-devel-2.28-236.el8_9.12.aarch6 1.4 MB/s | 84 kB 00:00 (129/174): tzdata-2024a-1.el8.noarch.rpm 1.7 MB/s | 475 kB 00:00 (130/174): glibc-gconv-extra-2.28-236.el8_9.12. 22 MB/s | 1.8 MB 00:00 (131/174): glibc-headers-2.28-236.el8_9.12.aarc 6.9 MB/s | 482 kB 00:00 (132/174): curl-7.61.1-33.el8_9.5.aarch64.rpm 5.7 MB/s | 350 kB 00:00 (133/174): libblkid-2.32.1-44.el8_9.1.aarch64.r 3.9 MB/s | 215 kB 00:00 (134/174): libcurl-7.61.1-33.el8_9.5.aarch64.rp 3.5 MB/s | 286 kB 00:00 (135/174): kernel-headers-4.18.0-513.24.1.el8_9 73 MB/s | 11 MB 00:00 (136/174): libmount-2.32.1-44.el8_9.1.aarch64.r 3.2 MB/s | 230 kB 00:00 (137/174): libuuid-2.32.1-44.el8_9.1.aarch64.rp 1.6 MB/s | 98 kB 00:00 (138/174): libsmartcols-2.32.1-44.el8_9.1.aarch 1.3 MB/s | 175 kB 00:00 (139/174): python3-pip-wheel-9.0.3-23.el8_9.1.n 14 MB/s | 866 kB 00:00 (140/174): libfdisk-2.32.1-44.el8_9.1.aarch64.r 1.0 MB/s | 244 kB 00:00 (141/174): util-linux-2.32.1-44.el8_9.1.aarch64 30 MB/s | 2.5 MB 00:00 (142/174): gnutls-3.6.16-8.el8_9.3.aarch64.rpm 15 MB/s | 940 kB 00:00 (143/174): expat-2.2.5-11.el8_9.1.aarch64.rpm 967 kB/s | 104 kB 00:00 (144/174): guile-2.0.14-7.el8.aarch64.rpm 41 MB/s | 3.5 MB 00:00 (145/174): libatomic_ops-7.6.2-3.el8.aarch64.rp 486 kB/s | 38 kB 00:00 (146/174): gc-7.6.4-3.el8.aarch64.rpm 1.3 MB/s | 99 kB 00:00 (147/174): rust-srpm-macros-5-2.el8.noarch.rpm 139 kB/s | 9.3 kB 00:00 (148/174): ghc-srpm-macros-1.4.2-7.el8.noarch.r 181 kB/s | 9.4 kB 00:00 (149/174): isl-0.16.1-6.el8.aarch64.rpm 3.1 MB/s | 778 kB 00:00 (150/174): openblas-srpm-macros-2-2.el8.noarch. 110 kB/s | 8.0 kB 00:00 (151/174): ocaml-srpm-macros-5-4.el8.noarch.rpm 98 kB/s | 9.5 kB 00:00 (152/174): perl-srpm-macros-1-25.el8.noarch.rpm 141 kB/s | 11 kB 00:00 (153/174): efi-srpm-macros-3-3.el8.noarch.rpm 360 kB/s | 22 kB 00:00 (154/174): zstd-1.4.4-1.el8.aarch64.rpm 4.6 MB/s | 303 kB 00:00 (155/174): libmpc-1.1.0-9.1.el8.aarch64.rpm 1.2 MB/s | 60 kB 00:00 (156/174): go-srpm-macros-2-17.el8.noarch.rpm 196 kB/s | 13 kB 00:00 (157/174): dwz-0.12-10.el8.aarch64.rpm 1.6 MB/s | 103 kB 00:00 (158/174): qt5-srpm-macros-5.15.3-1.el8.noarch. 185 kB/s | 11 kB 00:00 (159/174): python-rpm-macros-3-45.el8.noarch.rp 291 kB/s | 16 kB 00:00 (160/174): redhat-rpm-config-131-1.el8.noarch.r 1.6 MB/s | 91 kB 00:00 (161/174): python-srpm-macros-3-45.el8.noarch.r 316 kB/s | 16 kB 00:00 (162/174): python3-rpm-macros-3-45.el8.noarch.r 247 kB/s | 15 kB 00:00 (163/174): annobin-11.13-2.el8.aarch64.rpm 14 MB/s | 971 kB 00:00 (164/174): cpp-8.5.0-20.el8.aarch64.rpm 79 MB/s | 9.0 MB 00:00 (165/174): gcc-plugin-annobin-8.5.0-20.el8.aarc 478 kB/s | 34 kB 00:00 (166/174): libstdc++-devel-8.5.0-20.el8.aarch64 20 MB/s | 2.1 MB 00:00 (167/174): gcc-c++-8.5.0-20.el8.aarch64.rpm 68 MB/s | 11 MB 00:00 (168/174): gdb-headless-8.2-20.el8.aarch64.rpm 39 MB/s | 3.1 MB 00:00 (169/174): ansible-srpm-macros-1-12.el8.noarch. 1.6 MB/s | 21 kB 00:00 (170/174): epel-rpm-macros-8-41.noarch.rpm 7.1 MB/s | 27 kB 00:00 (171/174): fpc-srpm-macros-1.3-1.el8.noarch.rpm 2.5 MB/s | 8.2 kB 00:00 (172/174): lua-srpm-macros-1-13.el8.noarch.rpm 3.6 MB/s | 9.2 kB 00:00 (173/174): gcc-8.5.0-20.el8.aarch64.rpm 59 MB/s | 19 MB 00:00 (174/174): rpm-build-4.14.3-28.el8_9.aarch64.rp 1.3 MB/s | 173 kB 00:00 -------------------------------------------------------------------------------- Total 30 MB/s | 155 MB 00:05 Red Hat Enterprise Linux - BaseOS 3.1 MB/s | 3.1 kB 00:00 Importing GPG key 0xFD431D51: Userid : "Red Hat, Inc. (release key 2) " Fingerprint: 567E 347A D004 4ADE 55BA 8A5F 199E 2F91 FD43 1D51 From : /usr/share/distribution-gpg-keys/redhat/RPM-GPG-KEY-redhat8-release Key imported successfully Importing GPG key 0x2FA658E0: Userid : "Red Hat, Inc. (auxiliary key) " Fingerprint: 43A6 E49C 4A38 F4BE 9ABF 2A53 4568 9C88 2FA6 58E0 From : /usr/share/distribution-gpg-keys/redhat/RPM-GPG-KEY-redhat8-release Key imported successfully Extra Packages for Enterprise Linux 8 - aarch64 1.6 MB/s | 1.6 kB 00:00 Importing GPG key 0x2F86D6A1: Userid : "Fedora EPEL (8) " Fingerprint: 94E2 79EB 8D8F 25B2 1810 ADF1 21EA 45AB 2F86 D6A1 From : /usr/share/distribution-gpg-keys/epel/RPM-GPG-KEY-EPEL-8 Key imported successfully Running transaction check Transaction check succeeded. Running transaction test Transaction test succeeded. Running transaction Running scriptlet: filesystem-3.8-6.el8.aarch64 1/1 Preparing : 1/1 Installing : libgcc-8.5.0-20.el8.aarch64 1/174 Running scriptlet: libgcc-8.5.0-20.el8.aarch64 1/174 Installing : python-srpm-macros-3-45.el8.noarch 2/174 Installing : crypto-policies-20230731-1.git3177e06.el8.noarch 3/174 Running scriptlet: crypto-policies-20230731-1.git3177e06.el8.noarch 3/174 Installing : python-rpm-macros-3-45.el8.noarch 4/174 Installing : python3-pip-wheel-9.0.3-23.el8_9.1.noarch 5/174 Installing : redhat-release-8.9-0.1.el8.aarch64 6/174 Installing : setup-2.12.2-9.el8.noarch 7/174 warning: /etc/hosts created as /etc/hosts.rpmnew Running scriptlet: setup-2.12.2-9.el8.noarch 7/174 Installing : filesystem-3.8-6.el8.aarch64 8/174 Installing : python3-setuptools-wheel-39.2.0-7.el8.noarch 9/174 Installing : basesystem-11-5.el8.noarch 10/174 Installing : python3-rpm-macros-3-45.el8.noarch 11/174 Installing : fpc-srpm-macros-1.3-1.el8.noarch 12/174 Installing : ansible-srpm-macros-1-12.el8.noarch 13/174 Installing : qt5-srpm-macros-5.15.3-1.el8.noarch 14/174 Installing : go-srpm-macros-2-17.el8.noarch 15/174 Installing : perl-srpm-macros-1-25.el8.noarch 16/174 Installing : openblas-srpm-macros-2-2.el8.noarch 17/174 Installing : ocaml-srpm-macros-5-4.el8.noarch 18/174 Installing : ghc-srpm-macros-1.4.2-7.el8.noarch 19/174 Installing : rust-srpm-macros-5-2.el8.noarch 20/174 Installing : kernel-headers-4.18.0-513.24.1.el8_9.aarch64 21/174 Installing : tzdata-2024a-1.el8.noarch 22/174 Installing : libssh-config-0.9.6-13.el8_9.noarch 23/174 Installing : ncurses-base-6.1-10.20180224.el8.noarch 24/174 Installing : pcre2-10.32-3.el8_6.aarch64 25/174 Installing : libselinux-2.9-8.el8.aarch64 26/174 Installing : ncurses-libs-6.1-10.20180224.el8.aarch64 27/174 Installing : glibc-all-langpacks-2.28-236.el8_9.12.aarch64 28/174 Installing : glibc-common-2.28-236.el8_9.12.aarch64 29/174 Installing : glibc-gconv-extra-2.28-236.el8_9.12.aarch64 30/174 Running scriptlet: glibc-gconv-extra-2.28-236.el8_9.12.aarch64 30/174 Running scriptlet: glibc-2.28-236.el8_9.12.aarch64 31/174 Installing : glibc-2.28-236.el8_9.12.aarch64 31/174 Running scriptlet: glibc-2.28-236.el8_9.12.aarch64 31/174 Installing : bash-4.4.20-4.el8_6.aarch64 32/174 Running scriptlet: bash-4.4.20-4.el8_6.aarch64 32/174 Installing : libsepol-2.9-3.el8.aarch64 33/174 Running scriptlet: libsepol-2.9-3.el8.aarch64 33/174 Installing : zlib-1.2.11-25.el8.aarch64 34/174 Installing : info-6.5-7.el8.aarch64 35/174 Installing : bzip2-libs-1.0.6-26.el8.aarch64 36/174 Installing : xz-libs-5.2.4-4.el8_6.aarch64 37/174 Installing : gmp-1:6.1.2-10.el8.aarch64 38/174 Running scriptlet: gmp-1:6.1.2-10.el8.aarch64 38/174 Installing : libstdc++-8.5.0-20.el8.aarch64 39/174 Running scriptlet: libstdc++-8.5.0-20.el8.aarch64 39/174 Installing : libzstd-1.4.4-1.el8.aarch64 40/174 Installing : elfutils-libelf-0.189-3.el8.aarch64 41/174 Installing : libxcrypt-4.1.1-6.el8.aarch64 42/174 Installing : mpfr-3.1.6-1.el8.aarch64 43/174 Running scriptlet: mpfr-3.1.6-1.el8.aarch64 43/174 Installing : readline-7.0-10.el8.aarch64 44/174 Running scriptlet: readline-7.0-10.el8.aarch64 44/174 Installing : sqlite-libs-3.26.0-19.el8_9.aarch64 45/174 Installing : popt-1.18-1.el8.aarch64 46/174 Installing : libcap-2.48-6.el8_9.aarch64 47/174 Installing : libcom_err-1.45.6-5.el8.aarch64 48/174 Running scriptlet: libcom_err-1.45.6-5.el8.aarch64 48/174 Installing : libuuid-2.32.1-44.el8_9.1.aarch64 49/174 Running scriptlet: libuuid-2.32.1-44.el8_9.1.aarch64 49/174 Installing : chkconfig-1.19.2-1.el8.aarch64 50/174 Installing : libunistring-0.9.9-3.el8.aarch64 51/174 Installing : libattr-2.4.48-3.el8.aarch64 52/174 Installing : libacl-2.2.53-1.el8.aarch64 53/174 Installing : sed-4.5-5.el8.aarch64 54/174 Running scriptlet: sed-4.5-5.el8.aarch64 54/174 Installing : libgpg-error-1.31-1.el8.aarch64 55/174 Installing : lua-libs-5.3.4-12.el8.aarch64 56/174 Installing : libffi-3.1-24.el8.aarch64 57/174 Installing : p11-kit-0.23.22-1.el8.aarch64 58/174 Installing : libidn2-2.2.0-1.el8.aarch64 59/174 Installing : libmpc-1.1.0-9.1.el8.aarch64 60/174 Installing : file-libs-5.33-25.el8.aarch64 61/174 Installing : file-5.33-25.el8.aarch64 62/174 Installing : libgcrypt-1.8.5-7.el8_6.aarch64 63/174 Running scriptlet: libgcrypt-1.8.5-7.el8_6.aarch64 63/174 Installing : unzip-6.0-46.el8.aarch64 64/174 Installing : findutils-1:4.6.0-21.el8.aarch64 65/174 Running scriptlet: findutils-1:4.6.0-21.el8.aarch64 65/174 Installing : elfutils-default-yama-scope-0.189-3.el8.noarch 66/174 Running scriptlet: elfutils-default-yama-scope-0.189-3.el8.noarch 66/174 Installing : elfutils-libs-0.189-3.el8.aarch64 67/174 Running scriptlet: glibc-headers-2.28-236.el8_9.12.aarch64 68/174 Installing : glibc-headers-2.28-236.el8_9.12.aarch64 68/174 Installing : lz4-libs-1.8.3-3.el8_4.aarch64 69/174 Installing : pcre-8.42-6.el8.aarch64 70/174 Installing : grep-3.1-6.el8.aarch64 71/174 Running scriptlet: grep-3.1-6.el8.aarch64 71/174 Installing : keyutils-libs-1.5.10-9.el8.aarch64 72/174 Installing : libcap-ng-0.7.11-1.el8.aarch64 73/174 Installing : audit-libs-3.0.7-5.el8.aarch64 74/174 Installing : gdbm-libs-1:1.18-2.el8.aarch64 75/174 Installing : libtasn1-4.13-4.el8_7.aarch64 76/174 Running scriptlet: libtasn1-4.13-4.el8_7.aarch64 76/174 Installing : p11-kit-trust-0.23.22-1.el8.aarch64 77/174 Running scriptlet: p11-kit-trust-0.23.22-1.el8.aarch64 77/174 Installing : expat-2.2.5-11.el8_9.1.aarch64 78/174 Installing : gdbm-1:1.18-2.el8.aarch64 79/174 Installing : libsemanage-2.9-9.el8_6.aarch64 80/174 Installing : xz-5.2.4-4.el8_6.aarch64 81/174 Installing : elfutils-0.189-3.el8.aarch64 82/174 Installing : zip-3.0-23.el8.aarch64 83/174 Installing : cpp-8.5.0-20.el8.aarch64 84/174 Running scriptlet: cpp-8.5.0-20.el8.aarch64 84/174 Installing : libassuan-2.5.1-3.el8.aarch64 85/174 Installing : libksba-1.3.5-9.el8_7.aarch64 86/174 Installing : tar-2:1.30-9.el8.aarch64 87/174 Running scriptlet: tar-2:1.30-9.el8.aarch64 87/174 Installing : patch-2.7.6-11.el8.aarch64 88/174 Installing : dwz-0.12-10.el8.aarch64 89/174 Installing : libasan-8.5.0-20.el8.aarch64 90/174 Running scriptlet: libasan-8.5.0-20.el8.aarch64 90/174 Installing : libubsan-8.5.0-20.el8.aarch64 91/174 Running scriptlet: libubsan-8.5.0-20.el8.aarch64 91/174 Installing : libstdc++-devel-8.5.0-20.el8.aarch64 92/174 Installing : nettle-3.4.1-7.el8.aarch64 93/174 Running scriptlet: nettle-3.4.1-7.el8.aarch64 93/174 Installing : gnutls-3.6.16-8.el8_9.3.aarch64 94/174 Installing : isl-0.16.1-6.el8.aarch64 95/174 Running scriptlet: isl-0.16.1-6.el8.aarch64 95/174 Installing : libxml2-2.9.7-18.el8_9.aarch64 96/174 Installing : bzip2-1.0.6-26.el8.aarch64 97/174 Installing : diffutils-3.6-6.el8.aarch64 98/174 Running scriptlet: diffutils-3.6-6.el8.aarch64 98/174 Installing : coreutils-common-8.30-15.el8.aarch64 99/174 Running scriptlet: coreutils-common-8.30-15.el8.aarch64 99/174 Installing : libgomp-8.5.0-20.el8.aarch64 100/174 Running scriptlet: libgomp-8.5.0-20.el8.aarch64 100/174 Installing : libatomic-8.5.0-20.el8.aarch64 101/174 Running scriptlet: libatomic-8.5.0-20.el8.aarch64 101/174 Installing : zstd-1.4.4-1.el8.aarch64 102/174 Installing : libpkgconf-1.4.2-1.el8.aarch64 103/174 Installing : pkgconf-1.4.2-1.el8.aarch64 104/174 Installing : libsigsegv-2.11-5.el8.aarch64 105/174 Installing : gawk-4.2.1-4.el8.aarch64 106/174 Installing : libtool-ltdl-2.4.6-25.el8.aarch64 107/174 Running scriptlet: libtool-ltdl-2.4.6-25.el8.aarch64 107/174 Installing : npth-1.5-4.el8.aarch64 108/174 Installing : brotli-1.0.6-3.el8.aarch64 109/174 Installing : cpio-2.12-11.el8.aarch64 110/174 Installing : libverto-0.3.2-2.el8.aarch64 111/174 Installing : libnghttp2-1.33.0-5.el8_9.aarch64 112/174 Installing : ncurses-6.1-10.20180224.el8.aarch64 113/174 Installing : openssl-libs-1:1.1.1k-12.el8_9.aarch64 114/174 Running scriptlet: openssl-libs-1:1.1.1k-12.el8_9.aarch64 114/174 Installing : coreutils-8.30-15.el8.aarch64 115/174 Running scriptlet: ca-certificates-2023.2.60_v7.0.306-80.0.el8_8.no 116/174 Installing : ca-certificates-2023.2.60_v7.0.306-80.0.el8_8.no 116/174 Running scriptlet: ca-certificates-2023.2.60_v7.0.306-80.0.el8_8.no 116/174 Installing : libdb-5.3.28-42.el8_4.aarch64 117/174 Running scriptlet: libdb-5.3.28-42.el8_4.aarch64 117/174 Installing : krb5-libs-1.18.2-26.el8_9.aarch64 118/174 Installing : libtirpc-1.1.4-8.el8.aarch64 119/174 Running scriptlet: libtirpc-1.1.4-8.el8.aarch64 119/174 Installing : libblkid-2.32.1-44.el8_9.1.aarch64 120/174 Running scriptlet: libblkid-2.32.1-44.el8_9.1.aarch64 120/174 Installing : libmount-2.32.1-44.el8_9.1.aarch64 121/174 Running scriptlet: libmount-2.32.1-44.el8_9.1.aarch64 121/174 Installing : systemd-libs-239-78.el8.aarch64 122/174 Running scriptlet: systemd-libs-239-78.el8.aarch64 122/174 Installing : libnsl2-1.2.0-2.20180605git4a062cf.el8.aarch64 123/174 Running scriptlet: libnsl2-1.2.0-2.20180605git4a062cf.el8.aarch64 123/174 Installing : platform-python-setuptools-39.2.0-7.el8.noarch 124/174 Installing : platform-python-3.6.8-56.el8_9.3.aarch64 125/174 Running scriptlet: platform-python-3.6.8-56.el8_9.3.aarch64 125/174 Installing : python3-libs-3.6.8-56.el8_9.3.aarch64 126/174 Installing : gzip-1.9-13.el8_5.aarch64 127/174 Running scriptlet: gzip-1.9-13.el8_5.aarch64 127/174 Installing : cracklib-2.9.6-15.el8.aarch64 128/174 Installing : cracklib-dicts-2.9.6-15.el8.aarch64 129/174 Installing : binutils-2.30-123.el8.aarch64 130/174 Running scriptlet: binutils-2.30-123.el8.aarch64 130/174 Installing : shadow-utils-2:4.6-19.el8.aarch64 131/174 Running scriptlet: libutempter-1.1.6-14.el8.aarch64 132/174 Installing : libutempter-1.1.6-14.el8.aarch64 132/174 Running scriptlet: tpm2-tss-2.3.2-5.el8.aarch64 133/174 Installing : tpm2-tss-2.3.2-5.el8.aarch64 133/174 Running scriptlet: tpm2-tss-2.3.2-5.el8.aarch64 133/174 Installing : ima-evm-utils-1.3.2-12.el8.aarch64 134/174 Installing : libpwquality-1.4.4-6.el8.aarch64 135/174 Installing : pam-1.3.1-27.el8.aarch64 136/174 Running scriptlet: pam-1.3.1-27.el8.aarch64 136/174 Installing : libusbx-1.0.23-4.el8.aarch64 137/174 Installing : glib2-2.56.4-161.el8.aarch64 138/174 Installing : libbabeltrace-1.5.4-4.el8.aarch64 139/174 Running scriptlet: libbabeltrace-1.5.4-4.el8.aarch64 139/174 Installing : libfdisk-2.32.1-44.el8_9.1.aarch64 140/174 Running scriptlet: libfdisk-2.32.1-44.el8_9.1.aarch64 140/174 Installing : cyrus-sasl-lib-2.1.27-6.el8_5.aarch64 141/174 Running scriptlet: cyrus-sasl-lib-2.1.27-6.el8_5.aarch64 141/174 Installing : openldap-2.4.46-18.el8.aarch64 142/174 Installing : gnupg2-2.2.20-3.el8_6.aarch64 143/174 Installing : libssh-0.9.6-13.el8_9.aarch64 144/174 Installing : libdb-utils-5.3.28-42.el8_4.aarch64 145/174 Installing : libarchive-3.3.3-5.el8.aarch64 146/174 Installing : libsmartcols-2.32.1-44.el8_9.1.aarch64 147/174 Running scriptlet: libsmartcols-2.32.1-44.el8_9.1.aarch64 147/174 Installing : libatomic_ops-7.6.2-3.el8.aarch64 148/174 Installing : gc-7.6.4-3.el8.aarch64 149/174 Installing : guile-5:2.0.14-7.el8.aarch64 150/174 Running scriptlet: guile-5:2.0.14-7.el8.aarch64 150/174 Installing : publicsuffix-list-dafsa-20180723-1.el8.noarch 151/174 Installing : libpsl-0.20.2-6.el8.aarch64 152/174 Installing : libcurl-7.61.1-33.el8_9.5.aarch64 153/174 Installing : curl-7.61.1-33.el8_9.5.aarch64 154/174 Installing : rpm-4.14.3-28.el8_9.aarch64 155/174 Installing : rpm-libs-4.14.3-28.el8_9.aarch64 156/174 Running scriptlet: rpm-libs-4.14.3-28.el8_9.aarch64 156/174 Installing : rpm-build-libs-4.14.3-28.el8_9.aarch64 157/174 Running scriptlet: rpm-build-libs-4.14.3-28.el8_9.aarch64 157/174 Installing : gdb-headless-8.2-20.el8.aarch64 158/174 Installing : efi-srpm-macros-3-3.el8.noarch 159/174 Installing : lua-srpm-macros-1-13.el8.noarch 160/174 Installing : pkgconf-m4-1.4.2-1.el8.noarch 161/174 Installing : pkgconf-pkg-config-1.4.2-1.el8.aarch64 162/174 Installing : glibc-devel-2.28-236.el8_9.12.aarch64 163/174 Running scriptlet: glibc-devel-2.28-236.el8_9.12.aarch64 163/174 Installing : libxcrypt-devel-4.1.1-6.el8.aarch64 164/174 Installing : gcc-8.5.0-20.el8.aarch64 165/174 Running scriptlet: gcc-8.5.0-20.el8.aarch64 165/174 Installing : annobin-11.13-2.el8.aarch64 166/174 Installing : gcc-plugin-annobin-8.5.0-20.el8.aarch64 167/174 Installing : redhat-rpm-config-131-1.el8.noarch 168/174 Running scriptlet: redhat-rpm-config-131-1.el8.noarch 168/174 Installing : rpm-build-4.14.3-28.el8_9.aarch64 169/174 Installing : gcc-c++-8.5.0-20.el8.aarch64 170/174 Installing : epel-rpm-macros-8-41.noarch 171/174 Installing : util-linux-2.32.1-44.el8_9.1.aarch64 172/174 Running scriptlet: util-linux-2.32.1-44.el8_9.1.aarch64 172/174 Installing : which-2.21-20.el8.aarch64 173/174 Installing : make-1:4.2.1-11.el8.aarch64 174/174 Running scriptlet: make-1:4.2.1-11.el8.aarch64 174/174 Running scriptlet: filesystem-3.8-6.el8.aarch64 174/174 Running scriptlet: glibc-all-langpacks-2.28-236.el8_9.12.aarch64 174/174 Running scriptlet: ca-certificates-2023.2.60_v7.0.306-80.0.el8_8.no 174/174 Running scriptlet: guile-5:2.0.14-7.el8.aarch64 174/174 Running scriptlet: glibc-common-2.28-236.el8_9.12.aarch64 174/174 Running scriptlet: info-6.5-7.el8.aarch64 174/174 Running scriptlet: glib2-2.56.4-161.el8.aarch64 174/174 Verifying : bzip2-1.0.6-26.el8.aarch64 1/174 Verifying : bzip2-libs-1.0.6-26.el8.aarch64 2/174 Verifying : cracklib-2.9.6-15.el8.aarch64 3/174 Verifying : cracklib-dicts-2.9.6-15.el8.aarch64 4/174 Verifying : grep-3.1-6.el8.aarch64 5/174 Verifying : libacl-2.2.53-1.el8.aarch64 6/174 Verifying : libassuan-2.5.1-3.el8.aarch64 7/174 Verifying : libattr-2.4.48-3.el8.aarch64 8/174 Verifying : libgpg-error-1.31-1.el8.aarch64 9/174 Verifying : libnsl2-1.2.0-2.20180605git4a062cf.el8.aarch64 10/174 Verifying : libpkgconf-1.4.2-1.el8.aarch64 11/174 Verifying : libsigsegv-2.11-5.el8.aarch64 12/174 Verifying : libtool-ltdl-2.4.6-25.el8.aarch64 13/174 Verifying : libunistring-0.9.9-3.el8.aarch64 14/174 Verifying : libutempter-1.1.6-14.el8.aarch64 15/174 Verifying : mpfr-3.1.6-1.el8.aarch64 16/174 Verifying : npth-1.5-4.el8.aarch64 17/174 Verifying : pkgconf-1.4.2-1.el8.aarch64 18/174 Verifying : pkgconf-pkg-config-1.4.2-1.el8.aarch64 19/174 Verifying : readline-7.0-10.el8.aarch64 20/174 Verifying : zip-3.0-23.el8.aarch64 21/174 Verifying : basesystem-11-5.el8.noarch 22/174 Verifying : pkgconf-m4-1.4.2-1.el8.noarch 23/174 Verifying : publicsuffix-list-dafsa-20180723-1.el8.noarch 24/174 Verifying : gmp-1:6.1.2-10.el8.aarch64 25/174 Verifying : libidn2-2.2.0-1.el8.aarch64 26/174 Verifying : diffutils-3.6-6.el8.aarch64 27/174 Verifying : patch-2.7.6-11.el8.aarch64 28/174 Verifying : libpsl-0.20.2-6.el8.aarch64 29/174 Verifying : libusbx-1.0.23-4.el8.aarch64 30/174 Verifying : libzstd-1.4.4-1.el8.aarch64 31/174 Verifying : brotli-1.0.6-3.el8.aarch64 32/174 Verifying : ima-evm-utils-1.3.2-12.el8.aarch64 33/174 Verifying : p11-kit-trust-0.23.22-1.el8.aarch64 34/174 Verifying : popt-1.18-1.el8.aarch64 35/174 Verifying : libdb-5.3.28-42.el8_4.aarch64 36/174 Verifying : libdb-utils-5.3.28-42.el8_4.aarch64 37/174 Verifying : libsepol-2.9-3.el8.aarch64 38/174 Verifying : lz4-libs-1.8.3-3.el8_4.aarch64 39/174 Verifying : nettle-3.4.1-7.el8.aarch64 40/174 Verifying : openldap-2.4.46-18.el8.aarch64 41/174 Verifying : p11-kit-0.23.22-1.el8.aarch64 42/174 Verifying : pcre-8.42-6.el8.aarch64 43/174 Verifying : cyrus-sasl-lib-2.1.27-6.el8_5.aarch64 44/174 Verifying : filesystem-3.8-6.el8.aarch64 45/174 Verifying : gzip-1.9-13.el8_5.aarch64 46/174 Verifying : keyutils-libs-1.5.10-9.el8.aarch64 47/174 Verifying : libcap-ng-0.7.11-1.el8.aarch64 48/174 Verifying : libxcrypt-4.1.1-6.el8.aarch64 49/174 Verifying : libxcrypt-devel-4.1.1-6.el8.aarch64 50/174 Verifying : lua-libs-5.3.4-12.el8.aarch64 51/174 Verifying : make-1:4.2.1-11.el8.aarch64 52/174 Verifying : cpio-2.12-11.el8.aarch64 53/174 Verifying : gawk-4.2.1-4.el8.aarch64 54/174 Verifying : info-6.5-7.el8.aarch64 55/174 Verifying : sed-4.5-5.el8.aarch64 56/174 Verifying : unzip-6.0-46.el8.aarch64 57/174 Verifying : xz-5.2.4-4.el8_6.aarch64 58/174 Verifying : xz-libs-5.2.4-4.el8_6.aarch64 59/174 Verifying : bash-4.4.20-4.el8_6.aarch64 60/174 Verifying : gdbm-1:1.18-2.el8.aarch64 61/174 Verifying : gdbm-libs-1:1.18-2.el8.aarch64 62/174 Verifying : gnupg2-2.2.20-3.el8_6.aarch64 63/174 Verifying : libbabeltrace-1.5.4-4.el8.aarch64 64/174 Verifying : libcom_err-1.45.6-5.el8.aarch64 65/174 Verifying : libgcrypt-1.8.5-7.el8_6.aarch64 66/174 Verifying : libsemanage-2.9-9.el8_6.aarch64 67/174 Verifying : libtirpc-1.1.4-8.el8.aarch64 68/174 Verifying : libverto-0.3.2-2.el8.aarch64 69/174 Verifying : pcre2-10.32-3.el8_6.aarch64 70/174 Verifying : coreutils-8.30-15.el8.aarch64 71/174 Verifying : coreutils-common-8.30-15.el8.aarch64 72/174 Verifying : glib2-2.56.4-161.el8.aarch64 73/174 Verifying : libffi-3.1-24.el8.aarch64 74/174 Verifying : libksba-1.3.5-9.el8_7.aarch64 75/174 Verifying : libpwquality-1.4.4-6.el8.aarch64 76/174 Verifying : libselinux-2.9-8.el8.aarch64 77/174 Verifying : libtasn1-4.13-4.el8_7.aarch64 78/174 Verifying : platform-python-setuptools-39.2.0-7.el8.noarch 79/174 Verifying : python3-setuptools-wheel-39.2.0-7.el8.noarch 80/174 Verifying : setup-2.12.2-9.el8.noarch 81/174 Verifying : tar-2:1.30-9.el8.aarch64 82/174 Verifying : audit-libs-3.0.7-5.el8.aarch64 83/174 Verifying : ca-certificates-2023.2.60_v7.0.306-80.0.el8_8.no 84/174 Verifying : chkconfig-1.19.2-1.el8.aarch64 85/174 Verifying : crypto-policies-20230731-1.git3177e06.el8.noarch 86/174 Verifying : elfutils-0.189-3.el8.aarch64 87/174 Verifying : elfutils-libs-0.189-3.el8.aarch64 88/174 Verifying : file-5.33-25.el8.aarch64 89/174 Verifying : file-libs-5.33-25.el8.aarch64 90/174 Verifying : libarchive-3.3.3-5.el8.aarch64 91/174 Verifying : libgomp-8.5.0-20.el8.aarch64 92/174 Verifying : libnghttp2-1.33.0-5.el8_9.aarch64 93/174 Verifying : ncurses-6.1-10.20180224.el8.aarch64 94/174 Verifying : ncurses-libs-6.1-10.20180224.el8.aarch64 95/174 Verifying : tpm2-tss-2.3.2-5.el8.aarch64 96/174 Verifying : which-2.21-20.el8.aarch64 97/174 Verifying : zlib-1.2.11-25.el8.aarch64 98/174 Verifying : binutils-2.30-123.el8.aarch64 99/174 Verifying : elfutils-default-yama-scope-0.189-3.el8.noarch 100/174 Verifying : elfutils-libelf-0.189-3.el8.aarch64 101/174 Verifying : findutils-1:4.6.0-21.el8.aarch64 102/174 Verifying : krb5-libs-1.18.2-26.el8_9.aarch64 103/174 Verifying : libasan-8.5.0-20.el8.aarch64 104/174 Verifying : libatomic-8.5.0-20.el8.aarch64 105/174 Verifying : libcap-2.48-6.el8_9.aarch64 106/174 Verifying : libgcc-8.5.0-20.el8.aarch64 107/174 Verifying : libstdc++-8.5.0-20.el8.aarch64 108/174 Verifying : libubsan-8.5.0-20.el8.aarch64 109/174 Verifying : libxml2-2.9.7-18.el8_9.aarch64 110/174 Verifying : ncurses-base-6.1-10.20180224.el8.noarch 111/174 Verifying : openssl-libs-1:1.1.1k-12.el8_9.aarch64 112/174 Verifying : pam-1.3.1-27.el8.aarch64 113/174 Verifying : platform-python-3.6.8-56.el8_9.3.aarch64 114/174 Verifying : python3-libs-3.6.8-56.el8_9.3.aarch64 115/174 Verifying : redhat-release-8.9-0.1.el8.aarch64 116/174 Verifying : shadow-utils-2:4.6-19.el8.aarch64 117/174 Verifying : sqlite-libs-3.26.0-19.el8_9.aarch64 118/174 Verifying : systemd-libs-239-78.el8.aarch64 119/174 Verifying : libssh-0.9.6-13.el8_9.aarch64 120/174 Verifying : libssh-config-0.9.6-13.el8_9.noarch 121/174 Verifying : rpm-4.14.3-28.el8_9.aarch64 122/174 Verifying : rpm-build-libs-4.14.3-28.el8_9.aarch64 123/174 Verifying : rpm-libs-4.14.3-28.el8_9.aarch64 124/174 Verifying : tzdata-2024a-1.el8.noarch 125/174 Verifying : glibc-2.28-236.el8_9.12.aarch64 126/174 Verifying : glibc-all-langpacks-2.28-236.el8_9.12.aarch64 127/174 Verifying : glibc-common-2.28-236.el8_9.12.aarch64 128/174 Verifying : glibc-devel-2.28-236.el8_9.12.aarch64 129/174 Verifying : glibc-gconv-extra-2.28-236.el8_9.12.aarch64 130/174 Verifying : glibc-headers-2.28-236.el8_9.12.aarch64 131/174 Verifying : curl-7.61.1-33.el8_9.5.aarch64 132/174 Verifying : kernel-headers-4.18.0-513.24.1.el8_9.aarch64 133/174 Verifying : libblkid-2.32.1-44.el8_9.1.aarch64 134/174 Verifying : libcurl-7.61.1-33.el8_9.5.aarch64 135/174 Verifying : libfdisk-2.32.1-44.el8_9.1.aarch64 136/174 Verifying : libmount-2.32.1-44.el8_9.1.aarch64 137/174 Verifying : libsmartcols-2.32.1-44.el8_9.1.aarch64 138/174 Verifying : libuuid-2.32.1-44.el8_9.1.aarch64 139/174 Verifying : python3-pip-wheel-9.0.3-23.el8_9.1.noarch 140/174 Verifying : util-linux-2.32.1-44.el8_9.1.aarch64 141/174 Verifying : expat-2.2.5-11.el8_9.1.aarch64 142/174 Verifying : gnutls-3.6.16-8.el8_9.3.aarch64 143/174 Verifying : guile-5:2.0.14-7.el8.aarch64 144/174 Verifying : isl-0.16.1-6.el8.aarch64 145/174 Verifying : libatomic_ops-7.6.2-3.el8.aarch64 146/174 Verifying : gc-7.6.4-3.el8.aarch64 147/174 Verifying : rust-srpm-macros-5-2.el8.noarch 148/174 Verifying : ghc-srpm-macros-1.4.2-7.el8.noarch 149/174 Verifying : ocaml-srpm-macros-5-4.el8.noarch 150/174 Verifying : openblas-srpm-macros-2-2.el8.noarch 151/174 Verifying : perl-srpm-macros-1-25.el8.noarch 152/174 Verifying : zstd-1.4.4-1.el8.aarch64 153/174 Verifying : efi-srpm-macros-3-3.el8.noarch 154/174 Verifying : libmpc-1.1.0-9.1.el8.aarch64 155/174 Verifying : go-srpm-macros-2-17.el8.noarch 156/174 Verifying : dwz-0.12-10.el8.aarch64 157/174 Verifying : qt5-srpm-macros-5.15.3-1.el8.noarch 158/174 Verifying : python-rpm-macros-3-45.el8.noarch 159/174 Verifying : redhat-rpm-config-131-1.el8.noarch 160/174 Verifying : python-srpm-macros-3-45.el8.noarch 161/174 Verifying : python3-rpm-macros-3-45.el8.noarch 162/174 Verifying : annobin-11.13-2.el8.aarch64 163/174 Verifying : cpp-8.5.0-20.el8.aarch64 164/174 Verifying : gcc-8.5.0-20.el8.aarch64 165/174 Verifying : gcc-plugin-annobin-8.5.0-20.el8.aarch64 166/174 Verifying : libstdc++-devel-8.5.0-20.el8.aarch64 167/174 Verifying : gcc-c++-8.5.0-20.el8.aarch64 168/174 Verifying : gdb-headless-8.2-20.el8.aarch64 169/174 Verifying : rpm-build-4.14.3-28.el8_9.aarch64 170/174 Verifying : ansible-srpm-macros-1-12.el8.noarch 171/174 Verifying : epel-rpm-macros-8-41.noarch 172/174 Verifying : fpc-srpm-macros-1.3-1.el8.noarch 173/174 Verifying : lua-srpm-macros-1-13.el8.noarch 174/174 Installed products updated. Installed: annobin-11.13-2.el8.aarch64 ansible-srpm-macros-1-12.el8.noarch audit-libs-3.0.7-5.el8.aarch64 basesystem-11-5.el8.noarch bash-4.4.20-4.el8_6.aarch64 binutils-2.30-123.el8.aarch64 brotli-1.0.6-3.el8.aarch64 bzip2-1.0.6-26.el8.aarch64 bzip2-libs-1.0.6-26.el8.aarch64 ca-certificates-2023.2.60_v7.0.306-80.0.el8_8.noarch chkconfig-1.19.2-1.el8.aarch64 coreutils-8.30-15.el8.aarch64 coreutils-common-8.30-15.el8.aarch64 cpio-2.12-11.el8.aarch64 cpp-8.5.0-20.el8.aarch64 cracklib-2.9.6-15.el8.aarch64 cracklib-dicts-2.9.6-15.el8.aarch64 crypto-policies-20230731-1.git3177e06.el8.noarch curl-7.61.1-33.el8_9.5.aarch64 cyrus-sasl-lib-2.1.27-6.el8_5.aarch64 diffutils-3.6-6.el8.aarch64 dwz-0.12-10.el8.aarch64 efi-srpm-macros-3-3.el8.noarch elfutils-0.189-3.el8.aarch64 elfutils-default-yama-scope-0.189-3.el8.noarch elfutils-libelf-0.189-3.el8.aarch64 elfutils-libs-0.189-3.el8.aarch64 epel-rpm-macros-8-41.noarch expat-2.2.5-11.el8_9.1.aarch64 file-5.33-25.el8.aarch64 file-libs-5.33-25.el8.aarch64 filesystem-3.8-6.el8.aarch64 findutils-1:4.6.0-21.el8.aarch64 fpc-srpm-macros-1.3-1.el8.noarch gawk-4.2.1-4.el8.aarch64 gc-7.6.4-3.el8.aarch64 gcc-8.5.0-20.el8.aarch64 gcc-c++-8.5.0-20.el8.aarch64 gcc-plugin-annobin-8.5.0-20.el8.aarch64 gdb-headless-8.2-20.el8.aarch64 gdbm-1:1.18-2.el8.aarch64 gdbm-libs-1:1.18-2.el8.aarch64 ghc-srpm-macros-1.4.2-7.el8.noarch glib2-2.56.4-161.el8.aarch64 glibc-2.28-236.el8_9.12.aarch64 glibc-all-langpacks-2.28-236.el8_9.12.aarch64 glibc-common-2.28-236.el8_9.12.aarch64 glibc-devel-2.28-236.el8_9.12.aarch64 glibc-gconv-extra-2.28-236.el8_9.12.aarch64 glibc-headers-2.28-236.el8_9.12.aarch64 gmp-1:6.1.2-10.el8.aarch64 gnupg2-2.2.20-3.el8_6.aarch64 gnutls-3.6.16-8.el8_9.3.aarch64 go-srpm-macros-2-17.el8.noarch grep-3.1-6.el8.aarch64 guile-5:2.0.14-7.el8.aarch64 gzip-1.9-13.el8_5.aarch64 ima-evm-utils-1.3.2-12.el8.aarch64 info-6.5-7.el8.aarch64 isl-0.16.1-6.el8.aarch64 kernel-headers-4.18.0-513.24.1.el8_9.aarch64 keyutils-libs-1.5.10-9.el8.aarch64 krb5-libs-1.18.2-26.el8_9.aarch64 libacl-2.2.53-1.el8.aarch64 libarchive-3.3.3-5.el8.aarch64 libasan-8.5.0-20.el8.aarch64 libassuan-2.5.1-3.el8.aarch64 libatomic-8.5.0-20.el8.aarch64 libatomic_ops-7.6.2-3.el8.aarch64 libattr-2.4.48-3.el8.aarch64 libbabeltrace-1.5.4-4.el8.aarch64 libblkid-2.32.1-44.el8_9.1.aarch64 libcap-2.48-6.el8_9.aarch64 libcap-ng-0.7.11-1.el8.aarch64 libcom_err-1.45.6-5.el8.aarch64 libcurl-7.61.1-33.el8_9.5.aarch64 libdb-5.3.28-42.el8_4.aarch64 libdb-utils-5.3.28-42.el8_4.aarch64 libfdisk-2.32.1-44.el8_9.1.aarch64 libffi-3.1-24.el8.aarch64 libgcc-8.5.0-20.el8.aarch64 libgcrypt-1.8.5-7.el8_6.aarch64 libgomp-8.5.0-20.el8.aarch64 libgpg-error-1.31-1.el8.aarch64 libidn2-2.2.0-1.el8.aarch64 libksba-1.3.5-9.el8_7.aarch64 libmount-2.32.1-44.el8_9.1.aarch64 libmpc-1.1.0-9.1.el8.aarch64 libnghttp2-1.33.0-5.el8_9.aarch64 libnsl2-1.2.0-2.20180605git4a062cf.el8.aarch64 libpkgconf-1.4.2-1.el8.aarch64 libpsl-0.20.2-6.el8.aarch64 libpwquality-1.4.4-6.el8.aarch64 libselinux-2.9-8.el8.aarch64 libsemanage-2.9-9.el8_6.aarch64 libsepol-2.9-3.el8.aarch64 libsigsegv-2.11-5.el8.aarch64 libsmartcols-2.32.1-44.el8_9.1.aarch64 libssh-0.9.6-13.el8_9.aarch64 libssh-config-0.9.6-13.el8_9.noarch libstdc++-8.5.0-20.el8.aarch64 libstdc++-devel-8.5.0-20.el8.aarch64 libtasn1-4.13-4.el8_7.aarch64 libtirpc-1.1.4-8.el8.aarch64 libtool-ltdl-2.4.6-25.el8.aarch64 libubsan-8.5.0-20.el8.aarch64 libunistring-0.9.9-3.el8.aarch64 libusbx-1.0.23-4.el8.aarch64 libutempter-1.1.6-14.el8.aarch64 libuuid-2.32.1-44.el8_9.1.aarch64 libverto-0.3.2-2.el8.aarch64 libxcrypt-4.1.1-6.el8.aarch64 libxcrypt-devel-4.1.1-6.el8.aarch64 libxml2-2.9.7-18.el8_9.aarch64 libzstd-1.4.4-1.el8.aarch64 lua-libs-5.3.4-12.el8.aarch64 lua-srpm-macros-1-13.el8.noarch lz4-libs-1.8.3-3.el8_4.aarch64 make-1:4.2.1-11.el8.aarch64 mpfr-3.1.6-1.el8.aarch64 ncurses-6.1-10.20180224.el8.aarch64 ncurses-base-6.1-10.20180224.el8.noarch ncurses-libs-6.1-10.20180224.el8.aarch64 nettle-3.4.1-7.el8.aarch64 npth-1.5-4.el8.aarch64 ocaml-srpm-macros-5-4.el8.noarch openblas-srpm-macros-2-2.el8.noarch openldap-2.4.46-18.el8.aarch64 openssl-libs-1:1.1.1k-12.el8_9.aarch64 p11-kit-0.23.22-1.el8.aarch64 p11-kit-trust-0.23.22-1.el8.aarch64 pam-1.3.1-27.el8.aarch64 patch-2.7.6-11.el8.aarch64 pcre-8.42-6.el8.aarch64 pcre2-10.32-3.el8_6.aarch64 perl-srpm-macros-1-25.el8.noarch pkgconf-1.4.2-1.el8.aarch64 pkgconf-m4-1.4.2-1.el8.noarch pkgconf-pkg-config-1.4.2-1.el8.aarch64 platform-python-3.6.8-56.el8_9.3.aarch64 platform-python-setuptools-39.2.0-7.el8.noarch popt-1.18-1.el8.aarch64 publicsuffix-list-dafsa-20180723-1.el8.noarch python-rpm-macros-3-45.el8.noarch python-srpm-macros-3-45.el8.noarch python3-libs-3.6.8-56.el8_9.3.aarch64 python3-pip-wheel-9.0.3-23.el8_9.1.noarch python3-rpm-macros-3-45.el8.noarch python3-setuptools-wheel-39.2.0-7.el8.noarch qt5-srpm-macros-5.15.3-1.el8.noarch readline-7.0-10.el8.aarch64 redhat-release-8.9-0.1.el8.aarch64 redhat-rpm-config-131-1.el8.noarch rpm-4.14.3-28.el8_9.aarch64 rpm-build-4.14.3-28.el8_9.aarch64 rpm-build-libs-4.14.3-28.el8_9.aarch64 rpm-libs-4.14.3-28.el8_9.aarch64 rust-srpm-macros-5-2.el8.noarch sed-4.5-5.el8.aarch64 setup-2.12.2-9.el8.noarch shadow-utils-2:4.6-19.el8.aarch64 sqlite-libs-3.26.0-19.el8_9.aarch64 systemd-libs-239-78.el8.aarch64 tar-2:1.30-9.el8.aarch64 tpm2-tss-2.3.2-5.el8.aarch64 tzdata-2024a-1.el8.noarch unzip-6.0-46.el8.aarch64 util-linux-2.32.1-44.el8_9.1.aarch64 which-2.21-20.el8.aarch64 xz-5.2.4-4.el8_6.aarch64 xz-libs-5.2.4-4.el8_6.aarch64 zip-3.0-23.el8.aarch64 zlib-1.2.11-25.el8.aarch64 zstd-1.4.4-1.el8.aarch64 Complete! Finish: installing minimal buildroot with dnf Start: creating root cache Finish: creating root cache Finish: chroot init INFO: Installed packages: INFO: annobin-11.13-2.el8.aarch64 ansible-srpm-macros-1-12.el8.noarch audit-libs-3.0.7-5.el8.aarch64 basesystem-11-5.el8.noarch bash-4.4.20-4.el8_6.aarch64 binutils-2.30-123.el8.aarch64 brotli-1.0.6-3.el8.aarch64 bzip2-1.0.6-26.el8.aarch64 bzip2-libs-1.0.6-26.el8.aarch64 ca-certificates-2023.2.60_v7.0.306-80.0.el8_8.noarch chkconfig-1.19.2-1.el8.aarch64 coreutils-8.30-15.el8.aarch64 coreutils-common-8.30-15.el8.aarch64 cpio-2.12-11.el8.aarch64 cpp-8.5.0-20.el8.aarch64 cracklib-2.9.6-15.el8.aarch64 cracklib-dicts-2.9.6-15.el8.aarch64 crypto-policies-20230731-1.git3177e06.el8.noarch curl-7.61.1-33.el8_9.5.aarch64 cyrus-sasl-lib-2.1.27-6.el8_5.aarch64 diffutils-3.6-6.el8.aarch64 dwz-0.12-10.el8.aarch64 efi-srpm-macros-3-3.el8.noarch elfutils-0.189-3.el8.aarch64 elfutils-default-yama-scope-0.189-3.el8.noarch elfutils-libelf-0.189-3.el8.aarch64 elfutils-libs-0.189-3.el8.aarch64 epel-rpm-macros-8-41.noarch expat-2.2.5-11.el8_9.1.aarch64 file-5.33-25.el8.aarch64 file-libs-5.33-25.el8.aarch64 filesystem-3.8-6.el8.aarch64 findutils-4.6.0-21.el8.aarch64 fpc-srpm-macros-1.3-1.el8.noarch gawk-4.2.1-4.el8.aarch64 gc-7.6.4-3.el8.aarch64 gcc-8.5.0-20.el8.aarch64 gcc-c++-8.5.0-20.el8.aarch64 gcc-plugin-annobin-8.5.0-20.el8.aarch64 gdb-headless-8.2-20.el8.aarch64 gdbm-1.18-2.el8.aarch64 gdbm-libs-1.18-2.el8.aarch64 ghc-srpm-macros-1.4.2-7.el8.noarch glib2-2.56.4-161.el8.aarch64 glibc-2.28-236.el8_9.12.aarch64 glibc-all-langpacks-2.28-236.el8_9.12.aarch64 glibc-common-2.28-236.el8_9.12.aarch64 glibc-devel-2.28-236.el8_9.12.aarch64 glibc-gconv-extra-2.28-236.el8_9.12.aarch64 glibc-headers-2.28-236.el8_9.12.aarch64 gmp-6.1.2-10.el8.aarch64 gnupg2-2.2.20-3.el8_6.aarch64 gnutls-3.6.16-8.el8_9.3.aarch64 go-srpm-macros-2-17.el8.noarch gpg-pubkey-2f86d6a1-5cf7cefb gpg-pubkey-2fa658e0-45700c69 gpg-pubkey-fd431d51-4ae0493b grep-3.1-6.el8.aarch64 guile-2.0.14-7.el8.aarch64 gzip-1.9-13.el8_5.aarch64 ima-evm-utils-1.3.2-12.el8.aarch64 info-6.5-7.el8.aarch64 isl-0.16.1-6.el8.aarch64 kernel-headers-4.18.0-513.24.1.el8_9.aarch64 keyutils-libs-1.5.10-9.el8.aarch64 krb5-libs-1.18.2-26.el8_9.aarch64 libacl-2.2.53-1.el8.aarch64 libarchive-3.3.3-5.el8.aarch64 libasan-8.5.0-20.el8.aarch64 libassuan-2.5.1-3.el8.aarch64 libatomic-8.5.0-20.el8.aarch64 libatomic_ops-7.6.2-3.el8.aarch64 libattr-2.4.48-3.el8.aarch64 libbabeltrace-1.5.4-4.el8.aarch64 libblkid-2.32.1-44.el8_9.1.aarch64 libcap-2.48-6.el8_9.aarch64 libcap-ng-0.7.11-1.el8.aarch64 libcom_err-1.45.6-5.el8.aarch64 libcurl-7.61.1-33.el8_9.5.aarch64 libdb-5.3.28-42.el8_4.aarch64 libdb-utils-5.3.28-42.el8_4.aarch64 libfdisk-2.32.1-44.el8_9.1.aarch64 libffi-3.1-24.el8.aarch64 libgcc-8.5.0-20.el8.aarch64 libgcrypt-1.8.5-7.el8_6.aarch64 libgomp-8.5.0-20.el8.aarch64 libgpg-error-1.31-1.el8.aarch64 libidn2-2.2.0-1.el8.aarch64 libksba-1.3.5-9.el8_7.aarch64 libmount-2.32.1-44.el8_9.1.aarch64 libmpc-1.1.0-9.1.el8.aarch64 libnghttp2-1.33.0-5.el8_9.aarch64 libnsl2-1.2.0-2.20180605git4a062cf.el8.aarch64 libpkgconf-1.4.2-1.el8.aarch64 libpsl-0.20.2-6.el8.aarch64 libpwquality-1.4.4-6.el8.aarch64 libselinux-2.9-8.el8.aarch64 libsemanage-2.9-9.el8_6.aarch64 libsepol-2.9-3.el8.aarch64 libsigsegv-2.11-5.el8.aarch64 libsmartcols-2.32.1-44.el8_9.1.aarch64 libssh-0.9.6-13.el8_9.aarch64 libssh-config-0.9.6-13.el8_9.noarch libstdc++-8.5.0-20.el8.aarch64 libstdc++-devel-8.5.0-20.el8.aarch64 libtasn1-4.13-4.el8_7.aarch64 libtirpc-1.1.4-8.el8.aarch64 libtool-ltdl-2.4.6-25.el8.aarch64 libubsan-8.5.0-20.el8.aarch64 libunistring-0.9.9-3.el8.aarch64 libusbx-1.0.23-4.el8.aarch64 libutempter-1.1.6-14.el8.aarch64 libuuid-2.32.1-44.el8_9.1.aarch64 libverto-0.3.2-2.el8.aarch64 libxcrypt-4.1.1-6.el8.aarch64 libxcrypt-devel-4.1.1-6.el8.aarch64 libxml2-2.9.7-18.el8_9.aarch64 libzstd-1.4.4-1.el8.aarch64 lua-libs-5.3.4-12.el8.aarch64 lua-srpm-macros-1-13.el8.noarch lz4-libs-1.8.3-3.el8_4.aarch64 make-4.2.1-11.el8.aarch64 mpfr-3.1.6-1.el8.aarch64 ncurses-6.1-10.20180224.el8.aarch64 ncurses-base-6.1-10.20180224.el8.noarch ncurses-libs-6.1-10.20180224.el8.aarch64 nettle-3.4.1-7.el8.aarch64 npth-1.5-4.el8.aarch64 ocaml-srpm-macros-5-4.el8.noarch openblas-srpm-macros-2-2.el8.noarch openldap-2.4.46-18.el8.aarch64 openssl-libs-1.1.1k-12.el8_9.aarch64 p11-kit-0.23.22-1.el8.aarch64 p11-kit-trust-0.23.22-1.el8.aarch64 pam-1.3.1-27.el8.aarch64 patch-2.7.6-11.el8.aarch64 pcre-8.42-6.el8.aarch64 pcre2-10.32-3.el8_6.aarch64 perl-srpm-macros-1-25.el8.noarch pkgconf-1.4.2-1.el8.aarch64 pkgconf-m4-1.4.2-1.el8.noarch pkgconf-pkg-config-1.4.2-1.el8.aarch64 platform-python-3.6.8-56.el8_9.3.aarch64 platform-python-setuptools-39.2.0-7.el8.noarch popt-1.18-1.el8.aarch64 publicsuffix-list-dafsa-20180723-1.el8.noarch python-rpm-macros-3-45.el8.noarch python-srpm-macros-3-45.el8.noarch python3-libs-3.6.8-56.el8_9.3.aarch64 python3-pip-wheel-9.0.3-23.el8_9.1.noarch python3-rpm-macros-3-45.el8.noarch python3-setuptools-wheel-39.2.0-7.el8.noarch qt5-srpm-macros-5.15.3-1.el8.noarch readline-7.0-10.el8.aarch64 redhat-release-8.9-0.1.el8.aarch64 redhat-rpm-config-131-1.el8.noarch rpm-4.14.3-28.el8_9.aarch64 rpm-build-4.14.3-28.el8_9.aarch64 rpm-build-libs-4.14.3-28.el8_9.aarch64 rpm-libs-4.14.3-28.el8_9.aarch64 rust-srpm-macros-5-2.el8.noarch sed-4.5-5.el8.aarch64 setup-2.12.2-9.el8.noarch shadow-utils-4.6-19.el8.aarch64 sqlite-libs-3.26.0-19.el8_9.aarch64 systemd-libs-239-78.el8.aarch64 tar-1.30-9.el8.aarch64 tpm2-tss-2.3.2-5.el8.aarch64 tzdata-2024a-1.el8.noarch unzip-6.0-46.el8.aarch64 util-linux-2.32.1-44.el8_9.1.aarch64 which-2.21-20.el8.aarch64 xz-5.2.4-4.el8_6.aarch64 xz-libs-5.2.4-4.el8_6.aarch64 zip-3.0-23.el8.aarch64 zlib-1.2.11-25.el8.aarch64 zstd-1.4.4-1.el8.aarch64 Start: buildsrpm Start: rpmbuild -bs sh: -c: line 0: unexpected EOF while looking for matching `"' sh: -c: line 1: syntax error: unexpected end of file Building target platforms: aarch64 Building for target aarch64 Wrote: /builddir/build/SRPMS/cutlass-3.5.0-20240411.1.cu12_4.el8.src.rpm Finish: rpmbuild -bs cp: preserving permissions for ‘/var/lib/copr-rpmbuild/results/chroot_scan/var/lib/mock/rhel+epel-8-aarch64-1713469169.948153/root/var/log’: No such file or directory INFO: chroot_scan: 3 files copied to /var/lib/copr-rpmbuild/results/chroot_scan INFO: /var/lib/mock/rhel+epel-8-aarch64-1713469169.948153/root/var/log/dnf.rpm.log /var/lib/mock/rhel+epel-8-aarch64-1713469169.948153/root/var/log/dnf.librepo.log /var/lib/mock/rhel+epel-8-aarch64-1713469169.948153/root/var/log/dnf.log Finish: buildsrpm INFO: Done(/var/lib/copr-rpmbuild/workspace/workdir-zcskrd06/cutlass/cutlass.spec) Config(child) 1 minutes 18 seconds INFO: Results and/or logs in: /var/lib/copr-rpmbuild/results INFO: Cleaning up build root ('cleanup_on_success=True') Start: clean chroot INFO: unmounting tmpfs. Finish: clean chroot INFO: Start(/var/lib/copr-rpmbuild/results/cutlass-3.5.0-20240411.1.cu12_4.el8.src.rpm) Config(rhel+epel-8-aarch64) Start(bootstrap): chroot init INFO: mounting tmpfs at /var/lib/mock/rhel+epel-8-aarch64-bootstrap-1713469169.948153/root. INFO: reusing tmpfs at /var/lib/mock/rhel+epel-8-aarch64-bootstrap-1713469169.948153/root. INFO: calling preinit hooks INFO: enabled root cache INFO: enabled package manager cache Start(bootstrap): cleaning package manager metadata Finish(bootstrap): cleaning package manager metadata Finish(bootstrap): chroot init Start: chroot init INFO: mounting tmpfs at /var/lib/mock/rhel+epel-8-aarch64-1713469169.948153/root. INFO: calling preinit hooks INFO: enabled root cache Start: unpacking root cache Finish: unpacking root cache INFO: enabled package manager cache Start: cleaning package manager metadata Finish: cleaning package manager metadata INFO: enabled HW Info plugin INFO: Buildroot is handled by package management downloaded with a bootstrap image: rpm-4.14.3-28.el8_9.aarch64 python3-dnf-4.7.0-19.el8.noarch python3-dnf-plugins-core-4.0.21-23.el8.noarch yum-4.7.0-19.el8.noarch Finish: chroot init Start: build phase for cutlass-3.5.0-20240411.1.cu12_4.el8.src.rpm Start: build setup for cutlass-3.5.0-20240411.1.cu12_4.el8.src.rpm sh: -c: line 0: unexpected EOF while looking for matching `"' sh: -c: line 1: syntax error: unexpected end of file Building target platforms: aarch64 Building for target aarch64 Wrote: /builddir/build/SRPMS/cutlass-3.5.0-20240411.1.cu12_4.el8.src.rpm No matches found for the following disable plugin patterns: local, spacewalk, versionlock Updating Subscription Management repositories. Unable to read consumer identity This system is not registered with an entitlement server. You can use subscription-manager to register. Copr repository 14 kB/s | 1.8 kB 00:00 Additional repo copr_rezso_CUDA 125 kB/s | 1.8 kB 00:00 Additional repo http_developer_download_nvidia_ 708 kB/s | 3.5 kB 00:00 Additional repo http_developer_download_nvidia_ 518 kB/s | 3.5 kB 00:00 Additional repo http_developer_download_nvidia_ 784 kB/s | 3.5 kB 00:00 Red Hat Enterprise Linux - BaseOS 24 kB/s | 4.1 kB 00:00 Red Hat Enterprise Linux - AppStream 46 kB/s | 4.5 kB 00:00 Red Hat Enterprise Linux - CodeReady Linux Buil 54 kB/s | 4.5 kB 00:00 Extra Packages for Enterprise Linux 8 - aarch64 159 kB/s | 17 kB 00:00 Modular dependency problems: Problem 1: nothing provides requested module(nvidia-driver:latest-dkms:20240416083839) Problem 2: nothing provides requested module(nvidia-driver:latest-dkms:20240416084208) Package gcc-c++-8.5.0-20.el8.aarch64 is already installed. Dependencies resolved. ================================================================================================================================================================== Package Arch Version Repository Size ================================================================================================================================================================== Installing: cmake aarch64 3.26.5-1.el8_9 rhel-appstream 12 M cuda-cudart-devel-12-4 aarch64 12.4.127-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_sbsa 2.0 M cuda-driver-devel-12-4 aarch64 12.4.127-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_sbsa 43 k cuda-nvcc-12-4 aarch64 12.4.131-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_sbsa 64 M cuda-nvml-devel-12-4 aarch64 12.4.127-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_sbsa 226 k cuda-nvrtc-devel-12-4 aarch64 12.4.127-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_sbsa 26 M cuda-nvtx-12-4 aarch64 12.4.127-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_sbsa 89 k doxygen aarch64 1:1.8.14-12.el8 codeready-builder 3.6 M git aarch64 2.39.3-1.el8_8 rhel-appstream 104 k graphviz aarch64 2.40.1-44.el8 rhel-appstream 1.8 M libcublas-devel-12-4 aarch64 12.4.5.8-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_sbsa 388 M libcudnn8 aarch64 8.9.7.29-2.cuda12.3 copr_rezso_CUDA 466 M libcudnn8-devel aarch64 8.9.7.29-2.cuda12.3 copr_rezso_CUDA 35 k libcurand-devel-12-4 aarch64 10.3.5.147-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_sbsa 53 M python3-setuptools noarch 39.2.0-7.el8 rhel-baseos 163 k python36-devel aarch64 3.6.8-38.module+el8.9.0+20976+d3c38525 rhel-appstream 17 k Installing dependencies: adobe-mappings-cmap noarch 20171205-3.el8 rhel-appstream 2.1 M adobe-mappings-cmap-deprecated noarch 20171205-3.el8 rhel-appstream 119 k adobe-mappings-pdf noarch 20180407-1.el8 rhel-appstream 707 k atk aarch64 2.28.1-1.el8 rhel-appstream 270 k avahi-libs aarch64 0.7-21.el8_9.1 rhel-baseos 60 k cairo aarch64 1.15.12-6.el8 rhel-appstream 672 k cmake-data noarch 3.26.5-1.el8_9 rhel-appstream 1.9 M cmake-filesystem aarch64 3.26.5-1.el8_9 rhel-appstream 45 k cmake-rpm-macros noarch 3.26.5-1.el8_9 rhel-appstream 44 k cuda-cccl-12-4 aarch64 12.4.127-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_sbsa 1.9 M cuda-crt-12-4 aarch64 12.4.131-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_sbsa 112 k cuda-cudart-12-4 aarch64 12.4.127-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_sbsa 235 k cuda-nvrtc-12-4 aarch64 12.4.127-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_sbsa 23 M cuda-nvvm-12-4 aarch64 12.4.131-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_sbsa 25 M cuda-toolkit-12-4-config-common noarch 12.4.127-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_x86_64 7.7 k cuda-toolkit-12-config-common noarch 12.4.127-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_x86_64 7.9 k cuda-toolkit-config-common noarch 12.4.127-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_x86_64 7.9 k cups-libs aarch64 1:2.2.6-54.el8_9 rhel-baseos 420 k dbus-libs aarch64 1:1.12.8-26.el8 rhel-baseos 177 k emacs-filesystem noarch 1:26.1-11.el8 rhel-baseos 70 k fontconfig aarch64 2.13.1-4.el8 rhel-baseos 272 k fontpackages-filesystem noarch 1.44-22.el8 rhel-baseos 16 k freetype aarch64 2.9.1-9.el8 rhel-baseos 370 k fribidi aarch64 1.0.4-9.el8 rhel-appstream 89 k gd aarch64 2.2.5-7.el8 rhel-appstream 134 k gdk-pixbuf2 aarch64 2.36.12-5.el8 rhel-baseos 463 k gdk-pixbuf2-modules aarch64 2.36.12-5.el8 rhel-appstream 106 k git-core aarch64 2.39.3-1.el8_8 rhel-appstream 10 M git-core-doc noarch 2.39.3-1.el8_8 rhel-appstream 3.0 M google-droid-sans-fonts noarch 20120715-13.el8 rhel-appstream 2.5 M graphite2 aarch64 1.3.10-10.el8 rhel-appstream 113 k groff-base aarch64 1.22.3-18.el8 rhel-baseos 994 k gtk-update-icon-cache aarch64 3.22.30-11.el8 rhel-appstream 32 k gtk2 aarch64 2.24.32-5.el8 rhel-appstream 3.3 M harfbuzz aarch64 1.7.5-3.el8 rhel-appstream 279 k hicolor-icon-theme noarch 0.17-2.el8 rhel-appstream 48 k jasper-libs aarch64 2.0.14-5.el8 rhel-appstream 158 k jbig2dec-libs aarch64 0.16-1.el8 rhel-appstream 68 k jbigkit-libs aarch64 2.1-14.el8 rhel-appstream 54 k lcms2 aarch64 2.9-2.el8 rhel-appstream 156 k less aarch64 530-2.el8_9 rhel-baseos 161 k libICE aarch64 1.0.9-15.el8 rhel-appstream 71 k libSM aarch64 1.2.3-1.el8 rhel-appstream 46 k libX11 aarch64 1.6.8-6.el8 rhel-appstream 589 k libX11-common noarch 1.6.8-6.el8 rhel-appstream 158 k libXau aarch64 1.0.9-3.el8 rhel-appstream 37 k libXaw aarch64 1.0.13-10.el8 rhel-appstream 185 k libXcomposite aarch64 0.4.4-14.el8 rhel-appstream 29 k libXcursor aarch64 1.1.15-3.el8 rhel-appstream 35 k libXdamage aarch64 1.1.4-14.el8 rhel-appstream 27 k libXext aarch64 1.3.4-1.el8 rhel-appstream 44 k libXfixes aarch64 5.0.3-7.el8 rhel-appstream 25 k libXft aarch64 2.3.3-1.el8 rhel-appstream 65 k libXi aarch64 1.7.10-1.el8 rhel-appstream 46 k libXinerama aarch64 1.1.4-1.el8 rhel-appstream 15 k libXmu aarch64 1.1.3-1.el8 rhel-appstream 73 k libXpm aarch64 3.5.12-9.el8_7 rhel-appstream 56 k libXrandr aarch64 1.5.2-1.el8 rhel-appstream 33 k libXrender aarch64 0.9.10-7.el8 rhel-appstream 31 k libXt aarch64 1.1.5-12.el8 rhel-appstream 174 k libXxf86misc aarch64 1.0.4-1.el8 rhel-appstream 23 k libXxf86vm aarch64 1.1.4-9.el8 rhel-appstream 19 k libcroco aarch64 0.6.12-4.el8_2.1 rhel-baseos 108 k libcublas-12-4 aarch64 12.4.5.8-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_sbsa 346 M libcurand-12-4 aarch64 10.3.5.147-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_sbsa 53 M libdatrie aarch64 0.2.9-7.el8 rhel-appstream 33 k libedit aarch64 3.1-23.20170329cvs.el8 rhel-baseos 99 k libfontenc aarch64 1.1.3-8.el8 rhel-appstream 36 k libgs aarch64 9.27-11.el8 rhel-appstream 2.9 M libidn aarch64 1.34-5.el8 rhel-appstream 236 k libijs aarch64 0.35-5.el8 rhel-appstream 29 k libjpeg-turbo aarch64 1.5.3-12.el8 rhel-appstream 146 k libmcpp aarch64 2.7.2-20.el8 rhel-appstream 76 k libpaper aarch64 1.1.24-22.el8 rhel-appstream 44 k libpng aarch64 2:1.6.34-5.el8 rhel-baseos 119 k librsvg2 aarch64 2.42.7-5.el8 rhel-appstream 523 k libthai aarch64 0.1.27-2.el8 rhel-appstream 201 k libtiff aarch64 4.0.9-29.el8_8 rhel-appstream 180 k libuv aarch64 1:1.41.1-1.el8_4 rhel-appstream 150 k libwebp aarch64 1.0.0-9.el8_9.1 rhel-appstream 247 k libxcb aarch64 1.13.1-1.el8 rhel-appstream 223 k mcpp aarch64 2.7.2-20.el8 rhel-appstream 32 k openjpeg2 aarch64 2.4.0-5.el8 rhel-appstream 155 k openssh aarch64 8.0p1-19.el8_9.2 rhel-baseos 490 k openssh-clients aarch64 8.0p1-19.el8_9.2 rhel-baseos 627 k openssl aarch64 1:1.1.1k-12.el8_9 rhel-baseos 692 k pango aarch64 1.42.4-8.el8 rhel-appstream 285 k perl-Carp noarch 1.42-396.el8 rhel-baseos 30 k perl-Data-Dumper aarch64 2.167-399.el8 rhel-baseos 57 k perl-Digest noarch 1.17-395.el8 rhel-baseos 27 k perl-Digest-MD5 aarch64 2.55-396.el8 rhel-baseos 37 k perl-Encode aarch64 4:2.97-3.el8 rhel-baseos 1.5 M perl-Errno aarch64 1.28-422.el8 rhel-baseos 76 k perl-Error noarch 1:0.17025-2.el8 rhel-appstream 46 k perl-Exporter noarch 5.72-396.el8 rhel-baseos 34 k perl-File-Path noarch 2.15-2.el8 rhel-baseos 38 k perl-File-Temp noarch 0.230.600-1.el8 rhel-baseos 63 k perl-Getopt-Long noarch 1:2.50-4.el8 rhel-baseos 63 k perl-Git noarch 2.39.3-1.el8_8 rhel-appstream 79 k perl-HTTP-Tiny noarch 0.074-2.el8_9.1 rhel-baseos 59 k perl-IO aarch64 1.38-422.el8 rhel-baseos 142 k perl-IO-Socket-IP noarch 0.39-5.el8 rhel-baseos 47 k perl-IO-Socket-SSL noarch 2.066-4.module+el8.3.0+6446+594cad75 rhel-appstream 298 k perl-MIME-Base64 aarch64 3.15-396.el8 rhel-baseos 31 k perl-Mozilla-CA noarch 20160104-7.module+el8.3.0+6498+9eecfe51 rhel-appstream 15 k perl-Net-SSLeay aarch64 1.88-2.module+el8.6.0+13392+f0897f98 rhel-appstream 373 k perl-PathTools aarch64 3.74-1.el8 rhel-baseos 90 k perl-Pod-Escapes noarch 1:1.07-395.el8 rhel-baseos 20 k perl-Pod-Perldoc noarch 3.28-396.el8 rhel-baseos 88 k perl-Pod-Simple noarch 1:3.35-395.el8 rhel-baseos 213 k perl-Pod-Usage noarch 4:1.69-395.el8 rhel-baseos 34 k perl-Scalar-List-Utils aarch64 3:1.49-2.el8 rhel-baseos 67 k perl-Socket aarch64 4:2.027-3.el8 rhel-baseos 59 k perl-Storable aarch64 1:3.11-3.el8 rhel-baseos 95 k perl-Term-ANSIColor noarch 4.06-396.el8 rhel-baseos 46 k perl-Term-Cap noarch 1.17-395.el8 rhel-baseos 23 k perl-TermReadKey aarch64 2.37-7.el8 rhel-appstream 40 k perl-Text-ParseWords noarch 3.30-395.el8 rhel-baseos 18 k perl-Text-Tabs+Wrap noarch 2013.0523-395.el8 rhel-baseos 24 k perl-Time-Local noarch 1:1.280-1.el8 rhel-baseos 34 k perl-URI noarch 1.73-3.el8 rhel-baseos 116 k perl-Unicode-Normalize aarch64 1.25-396.el8 rhel-baseos 78 k perl-constant noarch 1.33-396.el8 rhel-baseos 25 k perl-interpreter aarch64 4:5.26.3-422.el8 rhel-baseos 6.3 M perl-libnet noarch 3.11-3.el8 rhel-baseos 121 k perl-libs aarch64 4:5.26.3-422.el8 rhel-baseos 1.5 M perl-macros aarch64 4:5.26.3-422.el8 rhel-baseos 73 k perl-parent noarch 1:0.237-1.el8 rhel-baseos 20 k perl-podlators noarch 4.11-1.el8 rhel-baseos 118 k perl-threads aarch64 1:2.21-2.el8 rhel-baseos 60 k perl-threads-shared aarch64 1.58-2.el8 rhel-baseos 47 k pixman aarch64 0.38.4-3.el8_9 rhel-appstream 149 k platform-python-devel aarch64 3.6.8-56.el8_9.3 rhel-appstream 240 k platform-python-pip noarch 9.0.3-23.el8_9.1 rhel-baseos 1.6 M python3-pip noarch 9.0.3-23.el8_9.1 rhel-appstream 20 k python3-rpm-generators noarch 5-8.el8 rhel-appstream 25 k python36 aarch64 3.6.8-38.module+el8.9.0+20976+d3c38525 rhel-appstream 19 k python36-rpm-macros noarch 3.6.8-38.module+el8.9.0+20976+d3c38525 rhel-appstream 16 k shared-mime-info aarch64 1.9-3.el8 rhel-baseos 327 k urw-base35-bookman-fonts noarch 20170801-10.el8 rhel-appstream 857 k urw-base35-c059-fonts noarch 20170801-10.el8 rhel-appstream 884 k urw-base35-d050000l-fonts noarch 20170801-10.el8 rhel-appstream 79 k urw-base35-fonts noarch 20170801-10.el8 rhel-appstream 12 k urw-base35-fonts-common noarch 20170801-10.el8 rhel-appstream 23 k urw-base35-gothic-fonts noarch 20170801-10.el8 rhel-appstream 654 k urw-base35-nimbus-mono-ps-fonts noarch 20170801-10.el8 rhel-appstream 801 k urw-base35-nimbus-roman-fonts noarch 20170801-10.el8 rhel-appstream 865 k urw-base35-nimbus-sans-fonts noarch 20170801-10.el8 rhel-appstream 1.3 M urw-base35-p052-fonts noarch 20170801-10.el8 rhel-appstream 982 k urw-base35-standard-symbols-ps-fonts noarch 20170801-10.el8 rhel-appstream 44 k urw-base35-z003-fonts noarch 20170801-10.el8 rhel-appstream 279 k vim-filesystem noarch 2:8.0.1763-19.el8_6.4 rhel-appstream 50 k xorg-x11-font-utils aarch64 1:7.5-41.el8 rhel-appstream 100 k xorg-x11-fonts-ISO8859-1-100dpi noarch 7.5-19.el8 rhel-appstream 1.1 M xorg-x11-server-utils aarch64 7.7-27.el8 rhel-appstream 190 k Enabling module streams: perl 5.26 perl-IO-Socket-SSL 2.066 perl-libwww-perl 6.34 python36 3.6 Transaction Summary ================================================================================================================================================================== Install 171 Packages Total download size: 1.5 G Installed size: 3.1 G Downloading Packages: (1/171): cuda-toolkit-12-4-config-common-12.4.1 498 kB/s | 7.7 kB 00:00 (2/171): cuda-toolkit-12-config-common-12.4.127 2.9 MB/s | 7.9 kB 00:00 (3/171): libcudnn8-devel-8.9.7.29-2.cuda12.3.aa 1.7 MB/s | 35 kB 00:00 (4/171): cuda-toolkit-config-common-12.4.127-1. 2.4 MB/s | 7.9 kB 00:00 (5/171): cuda-crt-12-4-12.4.131-1.aarch64.rpm 19 MB/s | 112 kB 00:00 (6/171): cuda-cudart-12-4-12.4.127-1.aarch64.rp 32 MB/s | 235 kB 00:00 (7/171): cuda-cccl-12-4-12.4.127-1.aarch64.rpm 99 MB/s | 1.9 MB 00:00 (8/171): cuda-driver-devel-12-4-12.4.127-1.aarc 11 MB/s | 43 kB 00:00 (9/171): cuda-cudart-devel-12-4-12.4.127-1.aarc 140 MB/s | 2.0 MB 00:00 (10/171): cuda-nvml-devel-12-4-12.4.127-1.aarch 49 MB/s | 226 kB 00:00 (11/171): cuda-nvrtc-12-4-12.4.127-1.aarch64.rp 161 MB/s | 23 MB 00:00 (12/171): cuda-nvrtc-devel-12-4-12.4.127-1.aarc 203 MB/s | 26 MB 00:00 (13/171): cuda-nvtx-12-4-12.4.127-1.aarch64.rpm 25 MB/s | 89 kB 00:00 (14/171): cuda-nvcc-12-4-12.4.131-1.aarch64.rpm 167 MB/s | 64 MB 00:00 (15/171): cuda-nvvm-12-4-12.4.131-1.aarch64.rpm 138 MB/s | 25 MB 00:00 (16/171): libcublas-12-4-12.4.5.8-1.aarch64.rpm 206 MB/s | 346 MB 00:01 (17/171): libcurand-12-4-10.3.5.147-1.aarch64.r 206 MB/s | 53 MB 00:00 (18/171): libcublas-devel-12-4-12.4.5.8-1.aarch 174 MB/s | 388 MB 00:02 (19/171): groff-base-1.22.3-18.el8.aarch64.rpm 6.8 MB/s | 994 kB 00:00 (20/171): libcurand-devel-12-4-10.3.5.147-1.aar 92 MB/s | 53 MB 00:00 (21/171): libedit-3.1-23.20170329cvs.el8.aarch6 1.6 MB/s | 99 kB 00:00 (22/171): perl-Data-Dumper-2.167-399.el8.aarch6 1.0 MB/s | 57 kB 00:00 (23/171): libpng-1.6.34-5.el8.aarch64.rpm 1.0 MB/s | 119 kB 00:00 (24/171): perl-Encode-2.97-3.el8.aarch64.rpm 13 MB/s | 1.5 MB 00:00 (25/171): perl-MIME-Base64-3.15-396.el8.aarch64 261 kB/s | 31 kB 00:00 (26/171): perl-PathTools-3.74-1.el8.aarch64.rpm 942 kB/s | 90 kB 00:00 (27/171): perl-Storable-3.11-3.el8.aarch64.rpm 1.5 MB/s | 95 kB 00:00 (28/171): perl-Unicode-Normalize-1.25-396.el8.a 963 kB/s | 78 kB 00:00 (29/171): perl-Scalar-List-Utils-1.49-2.el8.aar 292 kB/s | 67 kB 00:00 (30/171): perl-threads-2.21-2.el8.aarch64.rpm 696 kB/s | 60 kB 00:00 (31/171): perl-threads-shared-1.58-2.el8.aarch6 663 kB/s | 47 kB 00:00 (32/171): shared-mime-info-1.9-3.el8.aarch64.rp 4.4 MB/s | 327 kB 00:00 (33/171): fontpackages-filesystem-1.44-22.el8.n 159 kB/s | 16 kB 00:00 (34/171): perl-Carp-1.42-396.el8.noarch.rpm 358 kB/s | 30 kB 00:00 (35/171): perl-Exporter-5.72-396.el8.noarch.rpm 649 kB/s | 34 kB 00:00 (36/171): perl-File-Path-2.15-2.el8.noarch.rpm 637 kB/s | 38 kB 00:00 (37/171): perl-File-Temp-0.230.600-1.el8.noarch 901 kB/s | 63 kB 00:00 (38/171): perl-Getopt-Long-2.50-4.el8.noarch.rp 575 kB/s | 63 kB 00:00 (39/171): perl-Pod-Escapes-1.07-395.el8.noarch. 282 kB/s | 20 kB 00:00 (40/171): perl-Pod-Perldoc-3.28-396.el8.noarch. 1.4 MB/s | 88 kB 00:00 (41/171): perl-Pod-Simple-3.35-395.el8.noarch.r 2.7 MB/s | 213 kB 00:00 (42/171): perl-Pod-Usage-1.69-395.el8.noarch.rp 518 kB/s | 34 kB 00:00 (43/171): perl-Term-ANSIColor-4.06-396.el8.noar 725 kB/s | 46 kB 00:00 (44/171): perl-Socket-2.027-3.el8.aarch64.rpm 456 kB/s | 59 kB 00:00 (45/171): perl-Term-Cap-1.17-395.el8.noarch.rpm 306 kB/s | 23 kB 00:00 (46/171): perl-Text-ParseWords-3.30-395.el8.noa 313 kB/s | 18 kB 00:00 (47/171): perl-Time-Local-1.280-1.el8.noarch.rp 614 kB/s | 34 kB 00:00 (48/171): perl-Text-Tabs+Wrap-2013.0523-395.el8 376 kB/s | 24 kB 00:00 (49/171): perl-constant-1.33-396.el8.noarch.rpm 439 kB/s | 25 kB 00:00 (50/171): perl-parent-0.237-1.el8.noarch.rpm 354 kB/s | 20 kB 00:00 (51/171): perl-podlators-4.11-1.el8.noarch.rpm 1.6 MB/s | 118 kB 00:00 (52/171): gdk-pixbuf2-2.36.12-5.el8.aarch64.rpm 5.8 MB/s | 463 kB 00:00 (53/171): libcroco-0.6.12-4.el8_2.1.aarch64.rpm 1.5 MB/s | 108 kB 00:00 (54/171): fontconfig-2.13.1-4.el8.aarch64.rpm 2.6 MB/s | 272 kB 00:00 (55/171): freetype-2.9.1-9.el8.aarch64.rpm 4.5 MB/s | 370 kB 00:00 (56/171): perl-Errno-1.28-422.el8.aarch64.rpm 1.5 MB/s | 76 kB 00:00 (57/171): perl-IO-1.38-422.el8.aarch64.rpm 2.5 MB/s | 142 kB 00:00 (58/171): perl-interpreter-5.26.3-422.el8.aarch 55 MB/s | 6.3 MB 00:00 (59/171): perl-libs-5.26.3-422.el8.aarch64.rpm 21 MB/s | 1.5 MB 00:00 (60/171): perl-macros-5.26.3-422.el8.aarch64.rp 1.2 MB/s | 73 kB 00:00 (61/171): python3-setuptools-39.2.0-7.el8.noarc 1.8 MB/s | 163 kB 00:00 (62/171): cups-libs-2.2.6-54.el8_9.aarch64.rpm 5.1 MB/s | 420 kB 00:00 (63/171): dbus-libs-1.12.8-26.el8.aarch64.rpm 1.8 MB/s | 177 kB 00:00 (64/171): emacs-filesystem-26.1-11.el8.noarch.r 1.4 MB/s | 70 kB 00:00 (65/171): perl-URI-1.73-3.el8.noarch.rpm 2.3 MB/s | 116 kB 00:00 (66/171): perl-Digest-1.17-395.el8.noarch.rpm 387 kB/s | 27 kB 00:00 (67/171): perl-libnet-3.11-3.el8.noarch.rpm 2.4 MB/s | 121 kB 00:00 (68/171): avahi-libs-0.7-21.el8_9.1.aarch64.rpm 1.2 MB/s | 60 kB 00:00 (69/171): openssl-1.1.1k-12.el8_9.aarch64.rpm 9.8 MB/s | 692 kB 00:00 (70/171): perl-Digest-MD5-2.55-396.el8.aarch64. 309 kB/s | 37 kB 00:00 (71/171): openssh-8.0p1-19.el8_9.2.aarch64.rpm 7.6 MB/s | 490 kB 00:00 (72/171): perl-IO-Socket-IP-0.39-5.el8.noarch.r 347 kB/s | 47 kB 00:00 (73/171): openssh-clients-8.0p1-19.el8_9.2.aarc 12 MB/s | 627 kB 00:00 (74/171): less-530-2.el8_9.aarch64.rpm 1.8 MB/s | 161 kB 00:00 (75/171): perl-HTTP-Tiny-0.074-2.el8_9.1.noarch 847 kB/s | 59 kB 00:00 (76/171): jbigkit-libs-2.1-14.el8.aarch64.rpm 1.1 MB/s | 54 kB 00:00 (77/171): platform-python-pip-9.0.3-23.el8_9.1. 14 MB/s | 1.6 MB 00:00 (78/171): libXaw-1.0.13-10.el8.aarch64.rpm 3.2 MB/s | 185 kB 00:00 (79/171): libSM-1.2.3-1.el8.aarch64.rpm 418 kB/s | 46 kB 00:00 (80/171): libXxf86vm-1.1.4-9.el8.aarch64.rpm 159 kB/s | 19 kB 00:00 (81/171): libXcomposite-0.4.4-14.el8.aarch64.rp 183 kB/s | 29 kB 00:00 (82/171): mcpp-2.7.2-20.el8.aarch64.rpm 400 kB/s | 32 kB 00:00 (83/171): libfontenc-1.1.3-8.el8.aarch64.rpm 293 kB/s | 36 kB 00:00 (84/171): perl-TermReadKey-2.37-7.el8.aarch64.r 700 kB/s | 40 kB 00:00 (85/171): xorg-x11-server-utils-7.7-27.el8.aarc 2.8 MB/s | 190 kB 00:00 (86/171): graphite2-1.3.10-10.el8.aarch64.rpm 2.0 MB/s | 113 kB 00:00 (87/171): atk-2.28.1-1.el8.aarch64.rpm 2.4 MB/s | 270 kB 00:00 (88/171): lcms2-2.9-2.el8.aarch64.rpm 3.0 MB/s | 156 kB 00:00 (89/171): harfbuzz-1.7.5-3.el8.aarch64.rpm 3.7 MB/s | 279 kB 00:00 (90/171): libXdamage-1.1.4-14.el8.aarch64.rpm 558 kB/s | 27 kB 00:00 (91/171): libXfixes-5.0.3-7.el8.aarch64.rpm 280 kB/s | 25 kB 00:00 (92/171): libXinerama-1.1.4-1.el8.aarch64.rpm 285 kB/s | 15 kB 00:00 (93/171): libXrender-0.9.10-7.el8.aarch64.rpm 567 kB/s | 31 kB 00:00 (94/171): libXcursor-1.1.15-3.el8.aarch64.rpm 127 kB/s | 35 kB 00:00 (95/171): libXxf86misc-1.0.4-1.el8.aarch64.rpm 477 kB/s | 23 kB 00:00 (96/171): libidn-1.34-5.el8.aarch64.rpm 4.6 MB/s | 236 kB 00:00 (97/171): libdatrie-0.2.9-7.el8.aarch64.rpm 288 kB/s | 33 kB 00:00 (98/171): libmcpp-2.7.2-20.el8.aarch64.rpm 1.5 MB/s | 76 kB 00:00 (99/171): libijs-0.35-5.el8.aarch64.rpm 190 kB/s | 29 kB 00:00 (100/171): libthai-0.1.27-2.el8.aarch64.rpm 3.2 MB/s | 201 kB 00:00 (101/171): libpaper-1.1.24-22.el8.aarch64.rpm 279 kB/s | 44 kB 00:00 (102/171): google-droid-sans-fonts-20120715-13. 40 MB/s | 2.5 MB 00:00 (103/171): urw-base35-d050000l-fonts-20170801-1 1.6 MB/s | 79 kB 00:00 (104/171): hicolor-icon-theme-0.17-2.el8.noarch 445 kB/s | 48 kB 00:00 (105/171): urw-base35-fonts-20170801-10.el8.noa 247 kB/s | 12 kB 00:00 (106/171): urw-base35-gothic-fonts-20170801-10. 11 MB/s | 654 kB 00:00 (107/171): urw-base35-nimbus-sans-fonts-2017080 8.9 MB/s | 1.3 MB 00:00 (108/171): urw-base35-p052-fonts-20170801-10.el 7.9 MB/s | 982 kB 00:00 (109/171): xorg-x11-fonts-ISO8859-1-100dpi-7.5- 16 MB/s | 1.1 MB 00:00 (110/171): adobe-mappings-cmap-20171205-3.el8.n 16 MB/s | 2.1 MB 00:00 (111/171): adobe-mappings-pdf-20180407-1.el8.no 6.3 MB/s | 707 kB 00:00 (112/171): adobe-mappings-cmap-deprecated-20171 516 kB/s | 119 kB 00:00 (113/171): perl-Error-0.17025-2.el8.noarch.rpm 869 kB/s | 46 kB 00:00 (114/171): urw-base35-c059-fonts-20170801-10.el 16 MB/s | 884 kB 00:00 (115/171): urw-base35-bookman-fonts-20170801-10 10 MB/s | 857 kB 00:00 (116/171): urw-base35-fonts-common-20170801-10. 470 kB/s | 23 kB 00:00 (117/171): urw-base35-nimbus-mono-ps-fonts-2017 14 MB/s | 801 kB 00:00 (118/171): urw-base35-nimbus-roman-fonts-201708 16 MB/s | 865 kB 00:00 (119/171): urw-base35-standard-symbols-ps-fonts 757 kB/s | 44 kB 00:00 (120/171): urw-base35-z003-fonts-20170801-10.el 5.6 MB/s | 279 kB 00:00 (121/171): gdk-pixbuf2-modules-2.36.12-5.el8.aa 2.1 MB/s | 106 kB 00:00 (122/171): libICE-1.0.9-15.el8.aarch64.rpm 1.0 MB/s | 71 kB 00:00 (123/171): libXt-1.1.5-12.el8.aarch64.rpm 2.9 MB/s | 174 kB 00:00 (124/171): libxcb-1.13.1-1.el8.aarch64.rpm 4.4 MB/s | 223 kB 00:00 (125/171): perl-IO-Socket-SSL-2.066-4.module+el 5.6 MB/s | 298 kB 00:00 (126/171): perl-Mozilla-CA-20160104-7.module+el 324 kB/s | 15 kB 00:00 (127/171): libXext-1.3.4-1.el8.aarch64.rpm 752 kB/s | 44 kB 00:00 (128/171): libXau-1.0.9-3.el8.aarch64.rpm 276 kB/s | 37 kB 00:00 (129/171): libXi-1.7.10-1.el8.aarch64.rpm 665 kB/s | 46 kB 00:00 (130/171): libXmu-1.1.3-1.el8.aarch64.rpm 1.2 MB/s | 73 kB 00:00 (131/171): gd-2.2.5-7.el8.aarch64.rpm 2.5 MB/s | 134 kB 00:00 (132/171): libXft-2.3.3-1.el8.aarch64.rpm 1.2 MB/s | 65 kB 00:00 (133/171): libXrandr-1.5.2-1.el8.aarch64.rpm 574 kB/s | 33 kB 00:00 (134/171): jbig2dec-libs-0.16-1.el8.aarch64.rpm 1.3 MB/s | 68 kB 00:00 (135/171): gtk2-2.24.32-5.el8.aarch64.rpm 37 MB/s | 3.3 MB 00:00 (136/171): pango-1.42.4-8.el8.aarch64.rpm 5.6 MB/s | 285 kB 00:00 (137/171): libuv-1.41.1-1.el8_4.aarch64.rpm 2.1 MB/s | 150 kB 00:00 (138/171): xorg-x11-font-utils-7.5-41.el8.aarch 1.6 MB/s | 100 kB 00:00 (139/171): jasper-libs-2.0.14-5.el8.aarch64.rpm 2.3 MB/s | 158 kB 00:00 (140/171): libjpeg-turbo-1.5.3-12.el8.aarch64.r 2.9 MB/s | 146 kB 00:00 (141/171): perl-Net-SSLeay-1.88-2.module+el8.6. 7.1 MB/s | 373 kB 00:00 (142/171): cairo-1.15.12-6.el8.aarch64.rpm 12 MB/s | 672 kB 00:00 (143/171): fribidi-1.0.4-9.el8.aarch64.rpm 969 kB/s | 89 kB 00:00 (144/171): vim-filesystem-8.0.1763-19.el8_6.4.n 312 kB/s | 50 kB 00:00 (145/171): gtk-update-icon-cache-3.22.30-11.el8 602 kB/s | 32 kB 00:00 (146/171): openjpeg2-2.4.0-5.el8.aarch64.rpm 2.5 MB/s | 155 kB 00:00 (147/171): libXpm-3.5.12-9.el8_7.aarch64.rpm 1.1 MB/s | 56 kB 00:00 (148/171): git-2.39.3-1.el8_8.aarch64.rpm 2.1 MB/s | 104 kB 00:00 (149/171): graphviz-2.40.1-44.el8.aarch64.rpm 24 MB/s | 1.8 MB 00:00 (150/171): git-core-doc-2.39.3-1.el8_8.noarch.r 42 MB/s | 3.0 MB 00:00 (151/171): git-core-2.39.3-1.el8_8.aarch64.rpm 91 MB/s | 10 MB 00:00 (152/171): perl-Git-2.39.3-1.el8_8.noarch.rpm 1.6 MB/s | 79 kB 00:00 (153/171): python3-rpm-generators-5-8.el8.noarc 506 kB/s | 25 kB 00:00 (154/171): libtiff-4.0.9-29.el8_8.aarch64.rpm 2.7 MB/s | 180 kB 00:00 (155/171): libX11-1.6.8-6.el8.aarch64.rpm 8.1 MB/s | 589 kB 00:00 (156/171): libgs-9.27-11.el8.aarch64.rpm 45 MB/s | 2.9 MB 00:00 (157/171): librsvg2-2.42.7-5.el8.aarch64.rpm 9.7 MB/s | 523 kB 00:00 (158/171): libwebp-1.0.0-9.el8_9.1.aarch64.rpm 4.5 MB/s | 247 kB 00:00 (159/171): cmake-data-3.26.5-1.el8_9.noarch.rpm 28 MB/s | 1.9 MB 00:00 (160/171): cmake-3.26.5-1.el8_9.aarch64.rpm 99 MB/s | 12 MB 00:00 (161/171): cmake-filesystem-3.26.5-1.el8_9.aarc 960 kB/s | 45 kB 00:00 (162/171): cmake-rpm-macros-3.26.5-1.el8_9.noar 926 kB/s | 44 kB 00:00 (163/171): libX11-common-1.6.8-6.el8.noarch.rpm 2.2 MB/s | 158 kB 00:00 (164/171): pixman-0.38.4-3.el8_9.aarch64.rpm 3.0 MB/s | 149 kB 00:00 (165/171): platform-python-devel-3.6.8-56.el8_9 4.5 MB/s | 240 kB 00:00 (166/171): python36-3.6.8-38.module+el8.9.0+209 347 kB/s | 19 kB 00:00 (167/171): python36-devel-3.6.8-38.module+el8.9 344 kB/s | 17 kB 00:00 (168/171): python36-rpm-macros-3.6.8-38.module+ 319 kB/s | 16 kB 00:00 (169/171): python3-pip-9.0.3-23.el8_9.1.noarch. 371 kB/s | 20 kB 00:00 (170/171): doxygen-1.8.14-12.el8.aarch64.rpm 38 MB/s | 3.6 MB 00:00 (171/171): libcudnn8-8.9.7.29-2.cuda12.3.aarch6 6.7 MB/s | 466 MB 01:09 -------------------------------------------------------------------------------- Total 22 MB/s | 1.5 GB 01:09 Running transaction check Transaction check succeeded. Running transaction test Transaction test succeeded. Running transaction Preparing : 1/1 Installing : libpng-2:1.6.34-5.el8.aarch64 1/171 Installing : freetype-2.9.1-9.el8.aarch64 2/171 Installing : libjpeg-turbo-1.5.3-12.el8.aarch64 3/171 Installing : libICE-1.0.9-15.el8.aarch64 4/171 Installing : emacs-filesystem-1:26.1-11.el8.noarch 5/171 Installing : fontpackages-filesystem-1.44-22.el8.noarch 6/171 Installing : urw-base35-fonts-common-20170801-10.el8.noarch 7/171 Installing : cuda-toolkit-config-common-12.4.127-1.noarch 8/171 Installing : cuda-toolkit-12-config-common-12.4.127-1.noarch 9/171 Installing : cuda-toolkit-12-4-config-common-12.4.127-1.noarc 10/171 Installing : google-droid-sans-fonts-20120715-13.el8.noarch 11/171 Installing : fontconfig-2.13.1-4.el8.aarch64 12/171 Running scriptlet: fontconfig-2.13.1-4.el8.aarch64 12/171 Installing : libSM-1.2.3-1.el8.aarch64 13/171 Installing : cmake-rpm-macros-3.26.5-1.el8_9.noarch 14/171 Installing : cmake-filesystem-3.26.5-1.el8_9.aarch64 15/171 Installing : adobe-mappings-cmap-20171205-3.el8.noarch 16/171 Installing : atk-2.28.1-1.el8.aarch64 17/171 Installing : adobe-mappings-cmap-deprecated-20171205-3.el8.no 18/171 Installing : cuda-cudart-12-4-12.4.127-1.aarch64 19/171 Running scriptlet: cuda-cudart-12-4-12.4.127-1.aarch64 19/171 Installing : libcublas-12-4-12.4.5.8-1.aarch64 20/171 Running scriptlet: libcublas-12-4-12.4.5.8-1.aarch64 20/171 Installing : libcurand-12-4-10.3.5.147-1.aarch64 21/171 Running scriptlet: libcurand-12-4-10.3.5.147-1.aarch64 21/171 Installing : libidn-1.34-5.el8.aarch64 22/171 Running scriptlet: libidn-1.34-5.el8.aarch64 22/171 Installing : jasper-libs-2.0.14-5.el8.aarch64 23/171 Installing : pixman-0.38.4-3.el8_9.aarch64 24/171 Installing : libX11-common-1.6.8-6.el8.noarch 25/171 Installing : libwebp-1.0.0-9.el8_9.1.aarch64 26/171 Installing : python3-rpm-generators-5-8.el8.noarch 27/171 Installing : platform-python-devel-3.6.8-56.el8_9.3.aarch64 28/171 Installing : openjpeg2-2.4.0-5.el8.aarch64 29/171 Installing : fribidi-1.0.4-9.el8.aarch64 30/171 Installing : vim-filesystem-2:8.0.1763-19.el8_6.4.noarch 31/171 Installing : libuv-1:1.41.1-1.el8_4.aarch64 32/171 Installing : cmake-3.26.5-1.el8_9.aarch64 33/171 Installing : cmake-data-3.26.5-1.el8_9.noarch 34/171 Installing : jbig2dec-libs-0.16-1.el8.aarch64 35/171 Running scriptlet: jbig2dec-libs-0.16-1.el8.aarch64 35/171 Installing : libXau-1.0.9-3.el8.aarch64 36/171 Installing : libxcb-1.13.1-1.el8.aarch64 37/171 Installing : libX11-1.6.8-6.el8.aarch64 38/171 Installing : libXext-1.3.4-1.el8.aarch64 39/171 Installing : libXrender-0.9.10-7.el8.aarch64 40/171 Installing : cairo-1.15.12-6.el8.aarch64 41/171 Installing : libXt-1.1.5-12.el8.aarch64 42/171 Installing : libXmu-1.1.3-1.el8.aarch64 43/171 Installing : libXfixes-5.0.3-7.el8.aarch64 44/171 Installing : libXpm-3.5.12-9.el8_7.aarch64 45/171 Installing : libXcursor-1.1.15-3.el8.aarch64 46/171 Installing : libXrandr-1.5.2-1.el8.aarch64 47/171 Installing : libXinerama-1.1.4-1.el8.aarch64 48/171 Installing : libXi-1.7.10-1.el8.aarch64 49/171 Installing : libXaw-1.0.13-10.el8.aarch64 50/171 Installing : libXdamage-1.1.4-14.el8.aarch64 51/171 Installing : libXft-2.3.3-1.el8.aarch64 52/171 Installing : libXxf86vm-1.1.4-9.el8.aarch64 53/171 Installing : libXxf86misc-1.0.4-1.el8.aarch64 54/171 Installing : libXcomposite-0.4.4-14.el8.aarch64 55/171 Installing : adobe-mappings-pdf-20180407-1.el8.noarch 56/171 Installing : hicolor-icon-theme-0.17-2.el8.noarch 57/171 Installing : libpaper-1.1.24-22.el8.aarch64 58/171 Installing : libmcpp-2.7.2-20.el8.aarch64 59/171 Running scriptlet: libmcpp-2.7.2-20.el8.aarch64 59/171 Installing : mcpp-2.7.2-20.el8.aarch64 60/171 Installing : xorg-x11-server-utils-7.7-27.el8.aarch64 61/171 Installing : libijs-0.35-5.el8.aarch64 62/171 Installing : libdatrie-0.2.9-7.el8.aarch64 63/171 Running scriptlet: libdatrie-0.2.9-7.el8.aarch64 63/171 Installing : libthai-0.1.27-2.el8.aarch64 64/171 Running scriptlet: libthai-0.1.27-2.el8.aarch64 64/171 Installing : lcms2-2.9-2.el8.aarch64 65/171 Running scriptlet: lcms2-2.9-2.el8.aarch64 65/171 Installing : graphite2-1.3.10-10.el8.aarch64 66/171 Installing : harfbuzz-1.7.5-3.el8.aarch64 67/171 Running scriptlet: harfbuzz-1.7.5-3.el8.aarch64 67/171 Installing : pango-1.42.4-8.el8.aarch64 68/171 Running scriptlet: pango-1.42.4-8.el8.aarch64 68/171 Installing : libfontenc-1.1.3-8.el8.aarch64 69/171 Installing : xorg-x11-font-utils-1:7.5-41.el8.aarch64 70/171 Installing : urw-base35-d050000l-fonts-20170801-10.el8.noarch 71/171 Running scriptlet: urw-base35-d050000l-fonts-20170801-10.el8.noarch 71/171 Installing : urw-base35-gothic-fonts-20170801-10.el8.noarch 72/171 Running scriptlet: urw-base35-gothic-fonts-20170801-10.el8.noarch 72/171 Installing : urw-base35-nimbus-sans-fonts-20170801-10.el8.noa 73/171 Running scriptlet: urw-base35-nimbus-sans-fonts-20170801-10.el8.noa 73/171 Installing : urw-base35-p052-fonts-20170801-10.el8.noarch 74/171 Running scriptlet: urw-base35-p052-fonts-20170801-10.el8.noarch 74/171 Installing : xorg-x11-fonts-ISO8859-1-100dpi-7.5-19.el8.noarc 75/171 Running scriptlet: xorg-x11-fonts-ISO8859-1-100dpi-7.5-19.el8.noarc 75/171 Installing : urw-base35-bookman-fonts-20170801-10.el8.noarch 76/171 Running scriptlet: urw-base35-bookman-fonts-20170801-10.el8.noarch 76/171 Installing : urw-base35-c059-fonts-20170801-10.el8.noarch 77/171 Running scriptlet: urw-base35-c059-fonts-20170801-10.el8.noarch 77/171 Installing : urw-base35-nimbus-mono-ps-fonts-20170801-10.el8. 78/171 Running scriptlet: urw-base35-nimbus-mono-ps-fonts-20170801-10.el8. 78/171 Installing : urw-base35-nimbus-roman-fonts-20170801-10.el8.no 79/171 Running scriptlet: urw-base35-nimbus-roman-fonts-20170801-10.el8.no 79/171 Installing : urw-base35-standard-symbols-ps-fonts-20170801-10 80/171 Running scriptlet: urw-base35-standard-symbols-ps-fonts-20170801-10 80/171 Installing : urw-base35-z003-fonts-20170801-10.el8.noarch 81/171 Running scriptlet: urw-base35-z003-fonts-20170801-10.el8.noarch 81/171 Installing : urw-base35-fonts-20170801-10.el8.noarch 82/171 Installing : jbigkit-libs-2.1-14.el8.aarch64 83/171 Running scriptlet: jbigkit-libs-2.1-14.el8.aarch64 83/171 Installing : libtiff-4.0.9-29.el8_8.aarch64 84/171 Installing : gd-2.2.5-7.el8.aarch64 85/171 Running scriptlet: gd-2.2.5-7.el8.aarch64 85/171 Installing : platform-python-pip-9.0.3-23.el8_9.1.noarch 86/171 Installing : less-530-2.el8_9.aarch64 87/171 Running scriptlet: openssh-8.0p1-19.el8_9.2.aarch64 88/171 Installing : openssh-8.0p1-19.el8_9.2.aarch64 88/171 Installing : openssl-1:1.1.1k-12.el8_9.aarch64 89/171 Installing : dbus-libs-1:1.12.8-26.el8.aarch64 90/171 Running scriptlet: dbus-libs-1:1.12.8-26.el8.aarch64 90/171 Installing : avahi-libs-0.7-21.el8_9.1.aarch64 91/171 Installing : cups-libs-1:2.2.6-54.el8_9.aarch64 92/171 Installing : libgs-9.27-11.el8.aarch64 93/171 Installing : python3-setuptools-39.2.0-7.el8.noarch 94/171 Installing : python3-pip-9.0.3-23.el8_9.1.noarch 95/171 Installing : python36-3.6.8-38.module+el8.9.0+20976+d3c38525. 96/171 Running scriptlet: python36-3.6.8-38.module+el8.9.0+20976+d3c38525. 96/171 Installing : libcroco-0.6.12-4.el8_2.1.aarch64 97/171 Running scriptlet: libcroco-0.6.12-4.el8_2.1.aarch64 97/171 Installing : shared-mime-info-1.9-3.el8.aarch64 98/171 Running scriptlet: shared-mime-info-1.9-3.el8.aarch64 98/171 Installing : gdk-pixbuf2-2.36.12-5.el8.aarch64 99/171 Running scriptlet: gdk-pixbuf2-2.36.12-5.el8.aarch64 99/171 Installing : gdk-pixbuf2-modules-2.36.12-5.el8.aarch64 100/171 Installing : gtk-update-icon-cache-3.22.30-11.el8.aarch64 101/171 Installing : gtk2-2.24.32-5.el8.aarch64 102/171 Running scriptlet: gtk2-2.24.32-5.el8.aarch64 102/171 Installing : librsvg2-2.42.7-5.el8.aarch64 103/171 Installing : libedit-3.1-23.20170329cvs.el8.aarch64 104/171 Installing : openssh-clients-8.0p1-19.el8_9.2.aarch64 105/171 Installing : git-core-2.39.3-1.el8_8.aarch64 106/171 Installing : git-core-doc-2.39.3-1.el8_8.noarch 107/171 Installing : groff-base-1.22.3-18.el8.aarch64 108/171 Installing : perl-Digest-1.17-395.el8.noarch 109/171 Installing : perl-Digest-MD5-2.55-396.el8.aarch64 110/171 Installing : perl-Data-Dumper-2.167-399.el8.aarch64 111/171 Installing : perl-libnet-3.11-3.el8.noarch 112/171 Installing : perl-URI-1.73-3.el8.noarch 113/171 Installing : perl-Pod-Escapes-1:1.07-395.el8.noarch 114/171 Installing : perl-Time-Local-1:1.280-1.el8.noarch 115/171 Installing : perl-IO-Socket-IP-0.39-5.el8.noarch 116/171 Installing : perl-Mozilla-CA-20160104-7.module+el8.3.0+6498+9 117/171 Installing : perl-Net-SSLeay-1.88-2.module+el8.6.0+13392+f089 118/171 Installing : perl-IO-Socket-SSL-2.066-4.module+el8.3.0+6446+5 119/171 Installing : perl-Term-ANSIColor-4.06-396.el8.noarch 120/171 Installing : perl-Term-Cap-1.17-395.el8.noarch 121/171 Installing : perl-File-Temp-0.230.600-1.el8.noarch 122/171 Installing : perl-HTTP-Tiny-0.074-2.el8_9.1.noarch 123/171 Installing : perl-Pod-Simple-1:3.35-395.el8.noarch 124/171 Installing : perl-podlators-4.11-1.el8.noarch 125/171 Installing : perl-Pod-Perldoc-3.28-396.el8.noarch 126/171 Installing : perl-Text-ParseWords-3.30-395.el8.noarch 127/171 Installing : perl-Pod-Usage-4:1.69-395.el8.noarch 128/171 Installing : perl-MIME-Base64-3.15-396.el8.aarch64 129/171 Installing : perl-Storable-1:3.11-3.el8.aarch64 130/171 Installing : perl-Getopt-Long-1:2.50-4.el8.noarch 131/171 Installing : perl-Socket-4:2.027-3.el8.aarch64 132/171 Installing : perl-Errno-1.28-422.el8.aarch64 133/171 Installing : perl-Encode-4:2.97-3.el8.aarch64 134/171 Installing : perl-Scalar-List-Utils-3:1.49-2.el8.aarch64 135/171 Installing : perl-Carp-1.42-396.el8.noarch 136/171 Installing : perl-Exporter-5.72-396.el8.noarch 137/171 Installing : perl-libs-4:5.26.3-422.el8.aarch64 138/171 Installing : perl-parent-1:0.237-1.el8.noarch 139/171 Installing : perl-macros-4:5.26.3-422.el8.aarch64 140/171 Installing : perl-Unicode-Normalize-1.25-396.el8.aarch64 141/171 Installing : perl-threads-shared-1.58-2.el8.aarch64 142/171 Installing : perl-threads-1:2.21-2.el8.aarch64 143/171 Installing : perl-Text-Tabs+Wrap-2013.0523-395.el8.noarch 144/171 Installing : perl-constant-1.33-396.el8.noarch 145/171 Installing : perl-PathTools-3.74-1.el8.aarch64 146/171 Installing : perl-File-Path-2.15-2.el8.noarch 147/171 Installing : perl-IO-1.38-422.el8.aarch64 148/171 Installing : perl-interpreter-4:5.26.3-422.el8.aarch64 149/171 Installing : perl-TermReadKey-2.37-7.el8.aarch64 150/171 Installing : perl-Error-1:0.17025-2.el8.noarch 151/171 Installing : perl-Git-2.39.3-1.el8_8.noarch 152/171 Installing : git-2.39.3-1.el8_8.aarch64 153/171 Installing : cuda-nvvm-12-4-12.4.131-1.aarch64 154/171 Installing : cuda-nvrtc-12-4-12.4.127-1.aarch64 155/171 Running scriptlet: cuda-nvrtc-12-4-12.4.127-1.aarch64 155/171 Installing : cuda-crt-12-4-12.4.131-1.aarch64 156/171 Installing : cuda-cccl-12-4-12.4.127-1.aarch64 157/171 Installing : libcudnn8-8.9.7.29-2.cuda12.3.aarch64 158/171 Installing : libcudnn8-devel-8.9.7.29-2.cuda12.3.aarch64 159/171 Running scriptlet: libcudnn8-devel-8.9.7.29-2.cuda12.3.aarch64 159/171 Installing : cuda-cudart-devel-12-4-12.4.127-1.aarch64 160/171 Installing : cuda-nvcc-12-4-12.4.131-1.aarch64 161/171 Installing : cuda-nvrtc-devel-12-4-12.4.127-1.aarch64 162/171 Installing : doxygen-1:1.8.14-12.el8.aarch64 163/171 Installing : graphviz-2.40.1-44.el8.aarch64 164/171 Running scriptlet: graphviz-2.40.1-44.el8.aarch64 164/171 Installing : python36-devel-3.6.8-38.module+el8.9.0+20976+d3c 165/171 Running scriptlet: python36-devel-3.6.8-38.module+el8.9.0+20976+d3c 165/171 Installing : libcurand-devel-12-4-10.3.5.147-1.aarch64 166/171 Installing : libcublas-devel-12-4-12.4.5.8-1.aarch64 167/171 Installing : python36-rpm-macros-3.6.8-38.module+el8.9.0+2097 168/171 Installing : cuda-nvtx-12-4-12.4.127-1.aarch64 169/171 Installing : cuda-nvml-devel-12-4-12.4.127-1.aarch64 170/171 Installing : cuda-driver-devel-12-4-12.4.127-1.aarch64 171/171 Running scriptlet: cuda-toolkit-12-4-config-common-12.4.127-1.noarc 171/171 Running scriptlet: urw-base35-d050000l-fonts-20170801-10.el8.noarch 171/171 Running scriptlet: urw-base35-gothic-fonts-20170801-10.el8.noarch 171/171 Running scriptlet: urw-base35-nimbus-sans-fonts-20170801-10.el8.noa 171/171 Running scriptlet: urw-base35-p052-fonts-20170801-10.el8.noarch 171/171 Running scriptlet: urw-base35-bookman-fonts-20170801-10.el8.noarch 171/171 Running scriptlet: urw-base35-c059-fonts-20170801-10.el8.noarch 171/171 Running scriptlet: urw-base35-nimbus-mono-ps-fonts-20170801-10.el8. 171/171 Running scriptlet: urw-base35-nimbus-roman-fonts-20170801-10.el8.no 171/171 Running scriptlet: urw-base35-standard-symbols-ps-fonts-20170801-10 171/171 Running scriptlet: urw-base35-z003-fonts-20170801-10.el8.noarch 171/171 Running scriptlet: cuda-driver-devel-12-4-12.4.127-1.aarch64 171/171 Running scriptlet: fontconfig-2.13.1-4.el8.aarch64 171/171 Running scriptlet: hicolor-icon-theme-0.17-2.el8.noarch 171/171 Running scriptlet: shared-mime-info-1.9-3.el8.aarch64 171/171 Running scriptlet: gdk-pixbuf2-2.36.12-5.el8.aarch64 171/171 Verifying : libcudnn8-8.9.7.29-2.cuda12.3.aarch64 1/171 Verifying : libcudnn8-devel-8.9.7.29-2.cuda12.3.aarch64 2/171 Verifying : cuda-toolkit-12-4-config-common-12.4.127-1.noarc 3/171 Verifying : cuda-toolkit-12-config-common-12.4.127-1.noarch 4/171 Verifying : cuda-toolkit-config-common-12.4.127-1.noarch 5/171 Verifying : cuda-cccl-12-4-12.4.127-1.aarch64 6/171 Verifying : cuda-crt-12-4-12.4.131-1.aarch64 7/171 Verifying : cuda-cudart-12-4-12.4.127-1.aarch64 8/171 Verifying : cuda-cudart-devel-12-4-12.4.127-1.aarch64 9/171 Verifying : cuda-driver-devel-12-4-12.4.127-1.aarch64 10/171 Verifying : cuda-nvcc-12-4-12.4.131-1.aarch64 11/171 Verifying : cuda-nvml-devel-12-4-12.4.127-1.aarch64 12/171 Verifying : cuda-nvrtc-12-4-12.4.127-1.aarch64 13/171 Verifying : cuda-nvrtc-devel-12-4-12.4.127-1.aarch64 14/171 Verifying : cuda-nvtx-12-4-12.4.127-1.aarch64 15/171 Verifying : cuda-nvvm-12-4-12.4.131-1.aarch64 16/171 Verifying : libcublas-12-4-12.4.5.8-1.aarch64 17/171 Verifying : libcublas-devel-12-4-12.4.5.8-1.aarch64 18/171 Verifying : libcurand-12-4-10.3.5.147-1.aarch64 19/171 Verifying : libcurand-devel-12-4-10.3.5.147-1.aarch64 20/171 Verifying : groff-base-1.22.3-18.el8.aarch64 21/171 Verifying : libedit-3.1-23.20170329cvs.el8.aarch64 22/171 Verifying : libpng-2:1.6.34-5.el8.aarch64 23/171 Verifying : perl-Data-Dumper-2.167-399.el8.aarch64 24/171 Verifying : perl-Encode-4:2.97-3.el8.aarch64 25/171 Verifying : perl-MIME-Base64-3.15-396.el8.aarch64 26/171 Verifying : perl-PathTools-3.74-1.el8.aarch64 27/171 Verifying : perl-Scalar-List-Utils-3:1.49-2.el8.aarch64 28/171 Verifying : perl-Storable-1:3.11-3.el8.aarch64 29/171 Verifying : perl-Unicode-Normalize-1.25-396.el8.aarch64 30/171 Verifying : perl-threads-1:2.21-2.el8.aarch64 31/171 Verifying : perl-threads-shared-1.58-2.el8.aarch64 32/171 Verifying : shared-mime-info-1.9-3.el8.aarch64 33/171 Verifying : fontpackages-filesystem-1.44-22.el8.noarch 34/171 Verifying : perl-Carp-1.42-396.el8.noarch 35/171 Verifying : perl-Exporter-5.72-396.el8.noarch 36/171 Verifying : perl-File-Path-2.15-2.el8.noarch 37/171 Verifying : perl-File-Temp-0.230.600-1.el8.noarch 38/171 Verifying : perl-Getopt-Long-1:2.50-4.el8.noarch 39/171 Verifying : perl-Pod-Escapes-1:1.07-395.el8.noarch 40/171 Verifying : perl-Pod-Perldoc-3.28-396.el8.noarch 41/171 Verifying : perl-Pod-Simple-1:3.35-395.el8.noarch 42/171 Verifying : perl-Pod-Usage-4:1.69-395.el8.noarch 43/171 Verifying : perl-Socket-4:2.027-3.el8.aarch64 44/171 Verifying : perl-Term-ANSIColor-4.06-396.el8.noarch 45/171 Verifying : perl-Term-Cap-1.17-395.el8.noarch 46/171 Verifying : perl-Text-ParseWords-3.30-395.el8.noarch 47/171 Verifying : perl-Text-Tabs+Wrap-2013.0523-395.el8.noarch 48/171 Verifying : perl-Time-Local-1:1.280-1.el8.noarch 49/171 Verifying : perl-constant-1.33-396.el8.noarch 50/171 Verifying : perl-parent-1:0.237-1.el8.noarch 51/171 Verifying : perl-podlators-4.11-1.el8.noarch 52/171 Verifying : gdk-pixbuf2-2.36.12-5.el8.aarch64 53/171 Verifying : libcroco-0.6.12-4.el8_2.1.aarch64 54/171 Verifying : fontconfig-2.13.1-4.el8.aarch64 55/171 Verifying : freetype-2.9.1-9.el8.aarch64 56/171 Verifying : perl-Errno-1.28-422.el8.aarch64 57/171 Verifying : perl-IO-1.38-422.el8.aarch64 58/171 Verifying : perl-interpreter-4:5.26.3-422.el8.aarch64 59/171 Verifying : perl-libs-4:5.26.3-422.el8.aarch64 60/171 Verifying : perl-macros-4:5.26.3-422.el8.aarch64 61/171 Verifying : python3-setuptools-39.2.0-7.el8.noarch 62/171 Verifying : cups-libs-1:2.2.6-54.el8_9.aarch64 63/171 Verifying : dbus-libs-1:1.12.8-26.el8.aarch64 64/171 Verifying : emacs-filesystem-1:26.1-11.el8.noarch 65/171 Verifying : perl-Digest-1.17-395.el8.noarch 66/171 Verifying : perl-URI-1.73-3.el8.noarch 67/171 Verifying : perl-libnet-3.11-3.el8.noarch 68/171 Verifying : avahi-libs-0.7-21.el8_9.1.aarch64 69/171 Verifying : openssl-1:1.1.1k-12.el8_9.aarch64 70/171 Verifying : perl-Digest-MD5-2.55-396.el8.aarch64 71/171 Verifying : perl-IO-Socket-IP-0.39-5.el8.noarch 72/171 Verifying : openssh-8.0p1-19.el8_9.2.aarch64 73/171 Verifying : openssh-clients-8.0p1-19.el8_9.2.aarch64 74/171 Verifying : less-530-2.el8_9.aarch64 75/171 Verifying : perl-HTTP-Tiny-0.074-2.el8_9.1.noarch 76/171 Verifying : platform-python-pip-9.0.3-23.el8_9.1.noarch 77/171 Verifying : jbigkit-libs-2.1-14.el8.aarch64 78/171 Verifying : libSM-1.2.3-1.el8.aarch64 79/171 Verifying : libXaw-1.0.13-10.el8.aarch64 80/171 Verifying : libXcomposite-0.4.4-14.el8.aarch64 81/171 Verifying : libXxf86vm-1.1.4-9.el8.aarch64 82/171 Verifying : libfontenc-1.1.3-8.el8.aarch64 83/171 Verifying : mcpp-2.7.2-20.el8.aarch64 84/171 Verifying : perl-TermReadKey-2.37-7.el8.aarch64 85/171 Verifying : xorg-x11-server-utils-7.7-27.el8.aarch64 86/171 Verifying : atk-2.28.1-1.el8.aarch64 87/171 Verifying : graphite2-1.3.10-10.el8.aarch64 88/171 Verifying : harfbuzz-1.7.5-3.el8.aarch64 89/171 Verifying : lcms2-2.9-2.el8.aarch64 90/171 Verifying : libXcursor-1.1.15-3.el8.aarch64 91/171 Verifying : libXdamage-1.1.4-14.el8.aarch64 92/171 Verifying : libXfixes-5.0.3-7.el8.aarch64 93/171 Verifying : libXinerama-1.1.4-1.el8.aarch64 94/171 Verifying : libXrender-0.9.10-7.el8.aarch64 95/171 Verifying : libXxf86misc-1.0.4-1.el8.aarch64 96/171 Verifying : libdatrie-0.2.9-7.el8.aarch64 97/171 Verifying : libidn-1.34-5.el8.aarch64 98/171 Verifying : libijs-0.35-5.el8.aarch64 99/171 Verifying : libmcpp-2.7.2-20.el8.aarch64 100/171 Verifying : libpaper-1.1.24-22.el8.aarch64 101/171 Verifying : libthai-0.1.27-2.el8.aarch64 102/171 Verifying : google-droid-sans-fonts-20120715-13.el8.noarch 103/171 Verifying : hicolor-icon-theme-0.17-2.el8.noarch 104/171 Verifying : urw-base35-d050000l-fonts-20170801-10.el8.noarch 105/171 Verifying : urw-base35-fonts-20170801-10.el8.noarch 106/171 Verifying : urw-base35-gothic-fonts-20170801-10.el8.noarch 107/171 Verifying : urw-base35-nimbus-sans-fonts-20170801-10.el8.noa 108/171 Verifying : urw-base35-p052-fonts-20170801-10.el8.noarch 109/171 Verifying : xorg-x11-fonts-ISO8859-1-100dpi-7.5-19.el8.noarc 110/171 Verifying : adobe-mappings-cmap-20171205-3.el8.noarch 111/171 Verifying : adobe-mappings-cmap-deprecated-20171205-3.el8.no 112/171 Verifying : adobe-mappings-pdf-20180407-1.el8.noarch 113/171 Verifying : perl-Error-1:0.17025-2.el8.noarch 114/171 Verifying : urw-base35-bookman-fonts-20170801-10.el8.noarch 115/171 Verifying : urw-base35-c059-fonts-20170801-10.el8.noarch 116/171 Verifying : urw-base35-fonts-common-20170801-10.el8.noarch 117/171 Verifying : urw-base35-nimbus-mono-ps-fonts-20170801-10.el8. 118/171 Verifying : urw-base35-nimbus-roman-fonts-20170801-10.el8.no 119/171 Verifying : urw-base35-standard-symbols-ps-fonts-20170801-10 120/171 Verifying : urw-base35-z003-fonts-20170801-10.el8.noarch 121/171 Verifying : gdk-pixbuf2-modules-2.36.12-5.el8.aarch64 122/171 Verifying : libICE-1.0.9-15.el8.aarch64 123/171 Verifying : libXt-1.1.5-12.el8.aarch64 124/171 Verifying : libxcb-1.13.1-1.el8.aarch64 125/171 Verifying : perl-IO-Socket-SSL-2.066-4.module+el8.3.0+6446+5 126/171 Verifying : perl-Mozilla-CA-20160104-7.module+el8.3.0+6498+9 127/171 Verifying : libXau-1.0.9-3.el8.aarch64 128/171 Verifying : libXext-1.3.4-1.el8.aarch64 129/171 Verifying : libXi-1.7.10-1.el8.aarch64 130/171 Verifying : libXmu-1.1.3-1.el8.aarch64 131/171 Verifying : gd-2.2.5-7.el8.aarch64 132/171 Verifying : libXft-2.3.3-1.el8.aarch64 133/171 Verifying : libXrandr-1.5.2-1.el8.aarch64 134/171 Verifying : gtk2-2.24.32-5.el8.aarch64 135/171 Verifying : jbig2dec-libs-0.16-1.el8.aarch64 136/171 Verifying : libuv-1:1.41.1-1.el8_4.aarch64 137/171 Verifying : pango-1.42.4-8.el8.aarch64 138/171 Verifying : xorg-x11-font-utils-1:7.5-41.el8.aarch64 139/171 Verifying : jasper-libs-2.0.14-5.el8.aarch64 140/171 Verifying : libjpeg-turbo-1.5.3-12.el8.aarch64 141/171 Verifying : perl-Net-SSLeay-1.88-2.module+el8.6.0+13392+f089 142/171 Verifying : cairo-1.15.12-6.el8.aarch64 143/171 Verifying : vim-filesystem-2:8.0.1763-19.el8_6.4.noarch 144/171 Verifying : fribidi-1.0.4-9.el8.aarch64 145/171 Verifying : gtk-update-icon-cache-3.22.30-11.el8.aarch64 146/171 Verifying : openjpeg2-2.4.0-5.el8.aarch64 147/171 Verifying : libXpm-3.5.12-9.el8_7.aarch64 148/171 Verifying : graphviz-2.40.1-44.el8.aarch64 149/171 Verifying : git-2.39.3-1.el8_8.aarch64 150/171 Verifying : git-core-2.39.3-1.el8_8.aarch64 151/171 Verifying : git-core-doc-2.39.3-1.el8_8.noarch 152/171 Verifying : perl-Git-2.39.3-1.el8_8.noarch 153/171 Verifying : python3-rpm-generators-5-8.el8.noarch 154/171 Verifying : libtiff-4.0.9-29.el8_8.aarch64 155/171 Verifying : libX11-1.6.8-6.el8.aarch64 156/171 Verifying : libgs-9.27-11.el8.aarch64 157/171 Verifying : librsvg2-2.42.7-5.el8.aarch64 158/171 Verifying : libwebp-1.0.0-9.el8_9.1.aarch64 159/171 Verifying : cmake-3.26.5-1.el8_9.aarch64 160/171 Verifying : cmake-data-3.26.5-1.el8_9.noarch 161/171 Verifying : cmake-filesystem-3.26.5-1.el8_9.aarch64 162/171 Verifying : cmake-rpm-macros-3.26.5-1.el8_9.noarch 163/171 Verifying : libX11-common-1.6.8-6.el8.noarch 164/171 Verifying : pixman-0.38.4-3.el8_9.aarch64 165/171 Verifying : platform-python-devel-3.6.8-56.el8_9.3.aarch64 166/171 Verifying : python36-3.6.8-38.module+el8.9.0+20976+d3c38525. 167/171 Verifying : python36-devel-3.6.8-38.module+el8.9.0+20976+d3c 168/171 Verifying : python36-rpm-macros-3.6.8-38.module+el8.9.0+2097 169/171 Verifying : python3-pip-9.0.3-23.el8_9.1.noarch 170/171 Verifying : doxygen-1:1.8.14-12.el8.aarch64 171/171 Installed products updated. Installed: adobe-mappings-cmap-20171205-3.el8.noarch adobe-mappings-cmap-deprecated-20171205-3.el8.noarch adobe-mappings-pdf-20180407-1.el8.noarch atk-2.28.1-1.el8.aarch64 avahi-libs-0.7-21.el8_9.1.aarch64 cairo-1.15.12-6.el8.aarch64 cmake-3.26.5-1.el8_9.aarch64 cmake-data-3.26.5-1.el8_9.noarch cmake-filesystem-3.26.5-1.el8_9.aarch64 cmake-rpm-macros-3.26.5-1.el8_9.noarch cuda-cccl-12-4-12.4.127-1.aarch64 cuda-crt-12-4-12.4.131-1.aarch64 cuda-cudart-12-4-12.4.127-1.aarch64 cuda-cudart-devel-12-4-12.4.127-1.aarch64 cuda-driver-devel-12-4-12.4.127-1.aarch64 cuda-nvcc-12-4-12.4.131-1.aarch64 cuda-nvml-devel-12-4-12.4.127-1.aarch64 cuda-nvrtc-12-4-12.4.127-1.aarch64 cuda-nvrtc-devel-12-4-12.4.127-1.aarch64 cuda-nvtx-12-4-12.4.127-1.aarch64 cuda-nvvm-12-4-12.4.131-1.aarch64 cuda-toolkit-12-4-config-common-12.4.127-1.noarch cuda-toolkit-12-config-common-12.4.127-1.noarch cuda-toolkit-config-common-12.4.127-1.noarch cups-libs-1:2.2.6-54.el8_9.aarch64 dbus-libs-1:1.12.8-26.el8.aarch64 doxygen-1:1.8.14-12.el8.aarch64 emacs-filesystem-1:26.1-11.el8.noarch fontconfig-2.13.1-4.el8.aarch64 fontpackages-filesystem-1.44-22.el8.noarch freetype-2.9.1-9.el8.aarch64 fribidi-1.0.4-9.el8.aarch64 gd-2.2.5-7.el8.aarch64 gdk-pixbuf2-2.36.12-5.el8.aarch64 gdk-pixbuf2-modules-2.36.12-5.el8.aarch64 git-2.39.3-1.el8_8.aarch64 git-core-2.39.3-1.el8_8.aarch64 git-core-doc-2.39.3-1.el8_8.noarch google-droid-sans-fonts-20120715-13.el8.noarch graphite2-1.3.10-10.el8.aarch64 graphviz-2.40.1-44.el8.aarch64 groff-base-1.22.3-18.el8.aarch64 gtk-update-icon-cache-3.22.30-11.el8.aarch64 gtk2-2.24.32-5.el8.aarch64 harfbuzz-1.7.5-3.el8.aarch64 hicolor-icon-theme-0.17-2.el8.noarch jasper-libs-2.0.14-5.el8.aarch64 jbig2dec-libs-0.16-1.el8.aarch64 jbigkit-libs-2.1-14.el8.aarch64 lcms2-2.9-2.el8.aarch64 less-530-2.el8_9.aarch64 libICE-1.0.9-15.el8.aarch64 libSM-1.2.3-1.el8.aarch64 libX11-1.6.8-6.el8.aarch64 libX11-common-1.6.8-6.el8.noarch libXau-1.0.9-3.el8.aarch64 libXaw-1.0.13-10.el8.aarch64 libXcomposite-0.4.4-14.el8.aarch64 libXcursor-1.1.15-3.el8.aarch64 libXdamage-1.1.4-14.el8.aarch64 libXext-1.3.4-1.el8.aarch64 libXfixes-5.0.3-7.el8.aarch64 libXft-2.3.3-1.el8.aarch64 libXi-1.7.10-1.el8.aarch64 libXinerama-1.1.4-1.el8.aarch64 libXmu-1.1.3-1.el8.aarch64 libXpm-3.5.12-9.el8_7.aarch64 libXrandr-1.5.2-1.el8.aarch64 libXrender-0.9.10-7.el8.aarch64 libXt-1.1.5-12.el8.aarch64 libXxf86misc-1.0.4-1.el8.aarch64 libXxf86vm-1.1.4-9.el8.aarch64 libcroco-0.6.12-4.el8_2.1.aarch64 libcublas-12-4-12.4.5.8-1.aarch64 libcublas-devel-12-4-12.4.5.8-1.aarch64 libcudnn8-8.9.7.29-2.cuda12.3.aarch64 libcudnn8-devel-8.9.7.29-2.cuda12.3.aarch64 libcurand-12-4-10.3.5.147-1.aarch64 libcurand-devel-12-4-10.3.5.147-1.aarch64 libdatrie-0.2.9-7.el8.aarch64 libedit-3.1-23.20170329cvs.el8.aarch64 libfontenc-1.1.3-8.el8.aarch64 libgs-9.27-11.el8.aarch64 libidn-1.34-5.el8.aarch64 libijs-0.35-5.el8.aarch64 libjpeg-turbo-1.5.3-12.el8.aarch64 libmcpp-2.7.2-20.el8.aarch64 libpaper-1.1.24-22.el8.aarch64 libpng-2:1.6.34-5.el8.aarch64 librsvg2-2.42.7-5.el8.aarch64 libthai-0.1.27-2.el8.aarch64 libtiff-4.0.9-29.el8_8.aarch64 libuv-1:1.41.1-1.el8_4.aarch64 libwebp-1.0.0-9.el8_9.1.aarch64 libxcb-1.13.1-1.el8.aarch64 mcpp-2.7.2-20.el8.aarch64 openjpeg2-2.4.0-5.el8.aarch64 openssh-8.0p1-19.el8_9.2.aarch64 openssh-clients-8.0p1-19.el8_9.2.aarch64 openssl-1:1.1.1k-12.el8_9.aarch64 pango-1.42.4-8.el8.aarch64 perl-Carp-1.42-396.el8.noarch perl-Data-Dumper-2.167-399.el8.aarch64 perl-Digest-1.17-395.el8.noarch perl-Digest-MD5-2.55-396.el8.aarch64 perl-Encode-4:2.97-3.el8.aarch64 perl-Errno-1.28-422.el8.aarch64 perl-Error-1:0.17025-2.el8.noarch perl-Exporter-5.72-396.el8.noarch perl-File-Path-2.15-2.el8.noarch perl-File-Temp-0.230.600-1.el8.noarch perl-Getopt-Long-1:2.50-4.el8.noarch perl-Git-2.39.3-1.el8_8.noarch perl-HTTP-Tiny-0.074-2.el8_9.1.noarch perl-IO-1.38-422.el8.aarch64 perl-IO-Socket-IP-0.39-5.el8.noarch perl-IO-Socket-SSL-2.066-4.module+el8.3.0+6446+594cad75.noarch perl-MIME-Base64-3.15-396.el8.aarch64 perl-Mozilla-CA-20160104-7.module+el8.3.0+6498+9eecfe51.noarch perl-Net-SSLeay-1.88-2.module+el8.6.0+13392+f0897f98.aarch64 perl-PathTools-3.74-1.el8.aarch64 perl-Pod-Escapes-1:1.07-395.el8.noarch perl-Pod-Perldoc-3.28-396.el8.noarch perl-Pod-Simple-1:3.35-395.el8.noarch perl-Pod-Usage-4:1.69-395.el8.noarch perl-Scalar-List-Utils-3:1.49-2.el8.aarch64 perl-Socket-4:2.027-3.el8.aarch64 perl-Storable-1:3.11-3.el8.aarch64 perl-Term-ANSIColor-4.06-396.el8.noarch perl-Term-Cap-1.17-395.el8.noarch perl-TermReadKey-2.37-7.el8.aarch64 perl-Text-ParseWords-3.30-395.el8.noarch perl-Text-Tabs+Wrap-2013.0523-395.el8.noarch perl-Time-Local-1:1.280-1.el8.noarch perl-URI-1.73-3.el8.noarch perl-Unicode-Normalize-1.25-396.el8.aarch64 perl-constant-1.33-396.el8.noarch perl-interpreter-4:5.26.3-422.el8.aarch64 perl-libnet-3.11-3.el8.noarch perl-libs-4:5.26.3-422.el8.aarch64 perl-macros-4:5.26.3-422.el8.aarch64 perl-parent-1:0.237-1.el8.noarch perl-podlators-4.11-1.el8.noarch perl-threads-1:2.21-2.el8.aarch64 perl-threads-shared-1.58-2.el8.aarch64 pixman-0.38.4-3.el8_9.aarch64 platform-python-devel-3.6.8-56.el8_9.3.aarch64 platform-python-pip-9.0.3-23.el8_9.1.noarch python3-pip-9.0.3-23.el8_9.1.noarch python3-rpm-generators-5-8.el8.noarch python3-setuptools-39.2.0-7.el8.noarch python36-3.6.8-38.module+el8.9.0+20976+d3c38525.aarch64 python36-devel-3.6.8-38.module+el8.9.0+20976+d3c38525.aarch64 python36-rpm-macros-3.6.8-38.module+el8.9.0+20976+d3c38525.noarch shared-mime-info-1.9-3.el8.aarch64 urw-base35-bookman-fonts-20170801-10.el8.noarch urw-base35-c059-fonts-20170801-10.el8.noarch urw-base35-d050000l-fonts-20170801-10.el8.noarch urw-base35-fonts-20170801-10.el8.noarch urw-base35-fonts-common-20170801-10.el8.noarch urw-base35-gothic-fonts-20170801-10.el8.noarch urw-base35-nimbus-mono-ps-fonts-20170801-10.el8.noarch urw-base35-nimbus-roman-fonts-20170801-10.el8.noarch urw-base35-nimbus-sans-fonts-20170801-10.el8.noarch urw-base35-p052-fonts-20170801-10.el8.noarch urw-base35-standard-symbols-ps-fonts-20170801-10.el8.noarch urw-base35-z003-fonts-20170801-10.el8.noarch vim-filesystem-2:8.0.1763-19.el8_6.4.noarch xorg-x11-font-utils-1:7.5-41.el8.aarch64 xorg-x11-fonts-ISO8859-1-100dpi-7.5-19.el8.noarch xorg-x11-server-utils-7.7-27.el8.aarch64 Complete! Finish: build setup for cutlass-3.5.0-20240411.1.cu12_4.el8.src.rpm Start: rpmbuild cutlass-3.5.0-20240411.1.cu12_4.el8.src.rpm sh: -c: line 0: unexpected EOF while looking for matching `"' sh: -c: line 1: syntax error: unexpected end of file Building target platforms: aarch64 Building for target aarch64 Executing(%prep): /bin/sh -e /var/tmp/rpm-tmp.dLZv9O + umask 022 + cd /builddir/build/BUILD + cd /builddir/build/BUILD + rm -rf cutlass + /usr/bin/mkdir -p cutlass + cd cutlass + /usr/bin/chmod -Rf a+rX,u+w,g-w,o-w . + git clone --depth 1 -n -b v3.5.0 https://github.com/NVIDIA/cutlass.git . Cloning into '.'... + git reset --hard v3.5.0 HEAD is now at 7d49e6c Updates for CUTLASS 3.5.0 (#1468) + git log --format=fuller commit 7d49e6c7e2f8896c47f586706e67e1fb215529dc Author: Vijay Thakkar AuthorDate: Thu Apr 11 21:33:40 2024 -0400 Commit: GitHub CommitDate: Thu Apr 11 21:33:40 2024 -0400 Updates for CUTLASS 3.5.0 (#1468) Patch #0 (cutlass-fp16.patch): + echo 'Patch #0 (cutlass-fp16.patch):' + /usr/bin/patch --no-backup-if-mismatch -p0 -b --suffix .fp16~ --fuzz=100 patching file include/cutlass/functional.h Hunk #1 succeeded at 217 with fuzz 3 (offset 128 lines). + sed -i /-rpath/d CMakeLists.txt + exit 0 Executing(%build): /bin/sh -e /var/tmp/rpm-tmp.gV8c3Z + umask 022 + cd /builddir/build/BUILD + cd cutlass + mkdir -p build + pushd build ~/build/BUILD/cutlass/build ~/build/BUILD/cutlass + export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64/ + LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64/ + CFLAGS= + export CFLAGS + CXXFLAGS= + export CXXFLAGS + FFLAGS=' -I/usr/lib64/gfortran/modules' + export FFLAGS + FCFLAGS=' -I/usr/lib64/gfortran/modules' + export FCFLAGS + LDFLAGS='-Wl,-z,relro ' + export LDFLAGS + /usr/bin/cmake -DCMAKE_C_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_CXX_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_Fortran_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON -DCMAKE_INSTALL_PREFIX:PATH=/usr -DINCLUDE_INSTALL_DIR:PATH=/usr/include -DLIB_INSTALL_DIR:PATH=/usr/lib64 -DSYSCONF_INSTALL_DIR:PATH=/etc -DSHARE_INSTALL_PREFIX:PATH=/usr/share -DLIB_SUFFIX=64 -DBUILD_SHARED_LIBS:BOOL=ON .. -DCMAKE_SKIP_RPATH=ON -DCMAKE_VERBOSE_MAKEFILE=OFF -DCMAKE_BUILD_TYPE=Release -DCMAKE_EXE_LINKER_FLAGS=/usr/lib64/libstdc++.so.6 -DBUILD_TESTING=OFF -DCUTLASS_ENABLE_TESTS=OFF -DCUTLASS_ENABLE_PROFILER=ON -DCUTLASS_ENABLE_EXAMPLES=OFF -DCUDA_PROPAGATE_HOST_FLAGS=OFF -DCUTLASS_NVCC_EMBED_PTX=ON -DCUTLASS_NVCC_EMBED_CUBIN=ON '-DCUTLASS_NVCC_ARCHS=52;61;75;86;89;90' '-DCMAKE_CUDA_FLAGS=-Wl,--no-relax -Xfatbin=-compress-all --compiler-options -fPIC -Wno-deprecated-gpu-targets -allow-unsupported-compiler -D_SERIALIZE_H_INCLUDED' -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.4/bin/nvcc -- CMake Version: 3.26.5 -- CUTLASS 3.5.0 -- The CXX compiler identification is GNU 8.5.0 -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /usr/bin/c++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- The CUDA compiler identification is NVIDIA 12.4.131 -- Detecting CUDA compiler ABI info -- Detecting CUDA compiler ABI info - done -- Check for working CUDA compiler: /usr/local/cuda-12.4/bin/nvcc - skipped -- Detecting CUDA compile features -- Detecting CUDA compile features - done -- CUDART: /usr/local/cuda-12.4/lib64/libcudart.so -- CUDA Driver: /usr/local/cuda-12.4/lib64/stubs/libcuda.so -- NVRTC: /usr/local/cuda-12.4/lib64/libnvrtc.so -- Default Install Location: /usr -- Found Python3: /usr/bin/python3.6 (found suitable version "3.6.8", minimum required is "3.5") found components: Interpreter CMake Warning at CMakeLists.txt:156 (message): Using unsupported or deprecated compute capabilities 52;61. Support may be removed in future versions. -- CUDA Compilation Architectures: 52;61;75;86;89;90 -- Enable caching of reference results in conv unit tests -- Enable rigorous conv problem sizes in conv unit tests -- Using NVCC flags: --expt-relaxed-constexpr;-DCUTLASS_TEST_LEVEL=0;-DCUTLASS_TEST_ENABLE_CACHED_RESULTS=1;-DCUTLASS_CONV_UNIT_TEST_RIGOROUS_SIZE_ENABLED=1;-DCUTLASS_DEBUG_TRACE_LEVEL=0;-Xcompiler=-Wconversion;-Xcompiler=-fno-strict-aliasing -- CUTLASS Revision: 7d49e6c -- Configuring cublas ... -- cuBLAS Disabled. -- Configuring cuBLAS ... done. -- Completed generation of library instances. See /builddir/build/BUILD/cutlass/build/tools/library/library_instance_generation.log for more information. -- Configuring done (3.7s) -- Generating done (1.3s) CMake Warning: Manually-specified variables were not used by the project: CMAKE_C_FLAGS_RELEASE CMAKE_Fortran_FLAGS_RELEASE CUDA_PROPAGATE_HOST_FLAGS INCLUDE_INSTALL_DIR LIB_INSTALL_DIR LIB_SUFFIX SHARE_INSTALL_PREFIX SYSCONF_INSTALL_DIR -- Build files have been written to: /builddir/build/BUILD/cutlass/build + make -j4 [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_dgemm_objs.dir/generated/gemm/50/dgemm/all_sm50_dgemm_gemm_operations.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_z1684symm_objs.dir/generated/symm/90/z1684symm/all_sm90_z1684symm_symm_operations.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_cgemm_objs.dir/generated/gemm/50/cgemm/all_sm50_cgemm_gemm_operations.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/handle.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_cgemm_objs.dir/generated/gemm/50/cgemm/cutlass_simt_cgemm_128x64_8x2_nn_align1.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_z1684symm_objs.dir/generated/symm/90/z1684symm/cutlass_tensorop_z1684symm_128x64x8_1x1x1_3_n_ls_l_align1.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_dgemm_objs.dir/generated/gemm/50/dgemm/cutlass_simt_dgemm_128x128_8x2_nn_align1.cu.o [ 0%] Building CXX object tools/library/CMakeFiles/cutlass_library_objs.dir/src/manifest.cpp.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/operation_table.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/singleton.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/util.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_z1684symm_objs.dir/generated/symm/90/z1684symm/cutlass_tensorop_z1684symm_128x64x8_1x1x1_3_n_ls_u_align1.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_dgemm_objs.dir/generated/gemm/50/dgemm/cutlass_simt_dgemm_128x128_8x2_nt_align1.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_cgemm_objs.dir/generated/gemm/50/cgemm/cutlass_simt_cgemm_128x64_8x2_nt_align1.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_int4.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_z1684symm_objs.dir/generated/symm/90/z1684symm/cutlass_tensorop_z1684symm_128x64x8_1x1x1_3_n_rs_l_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_dgemm_objs.dir/generated/gemm/50/dgemm/cutlass_simt_dgemm_128x128_8x2_tn_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_cgemm_objs.dir/generated/gemm/50/cgemm/cutlass_simt_cgemm_128x64_8x2_tn_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_z1684symm_objs.dir/generated/symm/90/z1684symm/cutlass_tensorop_z1684symm_128x64x8_1x1x1_3_n_rs_u_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_dgemm_objs.dir/generated/gemm/50/dgemm/cutlass_simt_dgemm_128x128_8x2_tt_align1.cu.o [ 1%] Built target cutlass_library_symm_sm90_z1684symm_objs [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/all_sm50_sgemm_gemm_operations.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_cgemm_objs.dir/generated/gemm/50/cgemm/cutlass_simt_cgemm_128x64_8x2_tt_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_128x128_8x2_nn_align1.cu.o [ 1%] Built target cutlass_library_gemm_sm50_dgemm_objs [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm60_hgemm_objs.dir/generated/gemm/60/hgemm/all_sm60_hgemm_gemm_operations.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm60_hgemm_objs.dir/generated/gemm/60/hgemm/cutlass_simt_hgemm_256x128_8x2_nn_align1.cu.o [ 1%] Built target cutlass_library_gemm_sm50_cgemm_objs [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm61_igemm_s8_objs.dir/generated/gemm/61/igemm_s8/all_sm61_igemm_s8_gemm_operations.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_128x128_8x2_nt_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm61_igemm_s8_objs.dir/generated/gemm/61/igemm_s8/cutlass_simt_igemm_s8_128x128_32x2_nn_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_int8_canonical.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm60_hgemm_objs.dir/generated/gemm/60/hgemm/cutlass_simt_hgemm_256x128_8x2_nt_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_128x128_8x2_tn_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm61_igemm_s8_objs.dir/generated/gemm/61/igemm_s8/cutlass_simt_igemm_s8_128x128_32x2_nt_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_128x128_8x2_tt_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm60_hgemm_objs.dir/generated/gemm/60/hgemm/cutlass_simt_hgemm_256x128_8x2_tn_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm61_igemm_s8_objs.dir/generated/gemm/61/igemm_s8/cutlass_simt_igemm_s8_128x128_32x2_tn_align1.cu.o [ 1%] Built target cutlass_library_gemm_sm50_sgemm_objs [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm61_s8_igemm_s8_objs.dir/generated/gemm/61/s8_igemm_s8/all_sm61_s8_igemm_s8_gemm_operations.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm61_igemm_s8_objs.dir/generated/gemm/61/igemm_s8/cutlass_simt_igemm_s8_128x128_32x2_tt_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm61_s8_igemm_s8_objs.dir/generated/gemm/61/s8_igemm_s8/cutlass_simt_s8_igemm_s8_128x128_32x2_nn_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm60_hgemm_objs.dir/generated/gemm/60/hgemm/cutlass_simt_hgemm_256x128_8x2_tt_align1.cu.o [ 1%] Built target cutlass_library_gemm_sm61_igemm_s8_objs [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_f16_objs.dir/generated/gemm/70/f16_s884gemm_f16/all_sm70_f16_s884gemm_f16_gemm_operations.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm61_s8_igemm_s8_objs.dir/generated/gemm/61/s8_igemm_s8/cutlass_simt_s8_igemm_s8_128x128_32x2_nt_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_f16_objs.dir/generated/gemm/70/f16_s884gemm_f16/cutlass_tensorop_f16_s884gemm_f16_256x128_32x2_nn_align8.cu.o [ 1%] Built target cutlass_library_gemm_sm60_hgemm_objs [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/all_sm70_f16_s884gemm_planar_complex_array_f16_gemm_operations.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_nn_align8.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm61_s8_igemm_s8_objs.dir/generated/gemm/61/s8_igemm_s8/cutlass_simt_s8_igemm_s8_128x128_32x2_tn_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_f16_objs.dir/generated/gemm/70/f16_s884gemm_f16/cutlass_tensorop_f16_s884gemm_f16_256x128_32x2_nt_align8.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_cn_align8.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm61_s8_igemm_s8_objs.dir/generated/gemm/61/s8_igemm_s8/cutlass_simt_s8_igemm_s8_128x128_32x2_tt_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_f16_objs.dir/generated/gemm/70/f16_s884gemm_f16/cutlass_tensorop_f16_s884gemm_f16_256x128_32x2_tn_align8.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_nc_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_f16_objs.dir/generated/gemm/70/f16_s884gemm_f16/cutlass_tensorop_f16_s884gemm_f16_256x128_32x2_tt_align8.cu.o [ 2%] Built target cutlass_library_gemm_sm61_s8_igemm_s8_objs [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/all_sm70_f16_s884gemm_planar_complex_f16_gemm_operations.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_nn_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_cc_align8.cu.o [ 2%] Built target cutlass_library_gemm_sm70_f16_s884gemm_f16_objs [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_objs.dir/generated/gemm/70/h884gemm/all_sm70_h884gemm_gemm_operations.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_objs.dir/generated/gemm/70/h884gemm/cutlass_tensorop_h884gemm_256x128_32x2_nn_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_cn_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_nt_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_objs.dir/generated/gemm/70/h884gemm/cutlass_tensorop_h884gemm_256x128_32x2_nt_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_nc_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_ct_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_objs.dir/generated/gemm/70/h884gemm/cutlass_tensorop_h884gemm_256x128_32x2_tn_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_cc_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_nh_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_objs.dir/generated/gemm/70/h884gemm/cutlass_tensorop_h884gemm_256x128_32x2_tt_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_nt_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_int8_interleaved_32.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_ch_align8.cu.o [ 2%] Built target cutlass_library_gemm_sm70_h884gemm_objs [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/all_sm70_h884gemm_planar_complex_gemm_operations.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_ct_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_nn_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_tn_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_nh_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_cn_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_int8_interleaved_64.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_hn_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_ch_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_nc_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_tc_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_tn_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_cc_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_hn_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_hc_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_nt_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_tc_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_tt_align8.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_ct_align8.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_hc_align8.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_ht_align8.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_nh_align8.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_e4m3a_e4m3out.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_tt_align8.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_th_align8.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_ch_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_ht_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_tn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_hh_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_th_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_hn_align8.cu.o [ 4%] Built target cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/all_sm70_h884gemm_planar_complex_array_gemm_operations.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_nn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_hh_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_tc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_cn_align8.cu.o [ 4%] Built target cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_f16_objs.dir/generated/gemm/70/s884gemm_f16/all_sm70_s884gemm_f16_gemm_operations.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_hc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_f16_objs.dir/generated/gemm/70/s884gemm_f16/cutlass_tensorop_s884gemm_f16_256x128_32x2_nn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_nc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_tt_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_f16_objs.dir/generated/gemm/70/s884gemm_f16/cutlass_tensorop_s884gemm_f16_256x128_32x2_nt_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_ht_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_cc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_f16_objs.dir/generated/gemm/70/s884gemm_f16/cutlass_tensorop_s884gemm_f16_256x128_32x2_tn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_th_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_nt_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_f16_objs.dir/generated/gemm/70/s884gemm_f16/cutlass_tensorop_s884gemm_f16_256x128_32x2_tt_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_hh_align8.cu.o [ 4%] Built target cutlass_library_gemm_sm70_s884gemm_f16_objs [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/all_sm70_s884gemm_planar_complex_array_f16_gemm_operations.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_ct_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_nn_align8.cu.o [ 4%] Built target cutlass_library_gemm_sm70_h884gemm_planar_complex_objs [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/all_sm70_s884gemm_planar_complex_f16_gemm_operations.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_nn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_nh_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_cn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_cn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_ch_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_nc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_nc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_cc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_tn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_cc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_nt_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_hn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_nt_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_ct_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_tc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_ct_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_nh_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_nh_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_hc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_ch_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_ch_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_tt_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_tn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_tn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_ht_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_hn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_hn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_th_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_tc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_tc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_hh_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_hc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_hc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_tt_align8.cu.o [ 4%] Built target cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_f16_objs.dir/generated/gemm/75/f16_s1688gemm_f16/all_sm75_f16_s1688gemm_f16_gemm_operations.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_tt_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_f16_objs.dir/generated/gemm/75/f16_s1688gemm_f16/cutlass_tensorop_f16_s1688gemm_f16_256x128_32x2_nn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_ht_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_ht_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_f16_objs.dir/generated/gemm/75/f16_s1688gemm_f16/cutlass_tensorop_f16_s1688gemm_f16_256x128_32x2_nt_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_th_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_th_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_f16_objs.dir/generated/gemm/75/f16_s1688gemm_f16/cutlass_tensorop_f16_s1688gemm_f16_256x128_32x2_tn_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_hh_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_hh_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_f16_objs.dir/generated/gemm/75/f16_s1688gemm_f16/cutlass_tensorop_f16_s1688gemm_f16_256x128_32x2_tt_align8.cu.o [ 5%] Built target cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/all_sm75_f16_s1688gemm_planar_complex_array_f16_gemm_operations.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_nn_align8.cu.o [ 5%] Built target cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/all_sm75_f16_s1688gemm_planar_complex_f16_gemm_operations.cu.o [ 5%] Built target cutlass_library_gemm_sm75_f16_s1688gemm_f16_objs [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_objs.dir/generated/gemm/75/h1688gemm/all_sm75_h1688gemm_gemm_operations.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_nn_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_objs.dir/generated/gemm/75/h1688gemm/cutlass_tensorop_h1688gemm_256x128_32x2_nn_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_cn_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_cn_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_objs.dir/generated/gemm/75/h1688gemm/cutlass_tensorop_h1688gemm_256x128_32x2_nt_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_nc_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_nc_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_objs.dir/generated/gemm/75/h1688gemm/cutlass_tensorop_h1688gemm_256x128_32x2_tn_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_cc_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_cc_align8.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_objs.dir/generated/gemm/75/h1688gemm/cutlass_tensorop_h1688gemm_256x128_32x2_tt_align8.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_nt_align8.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_nt_align8.cu.o [ 6%] Built target cutlass_library_gemm_sm75_h1688gemm_objs [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/all_sm75_h1688gemm_planar_complex_gemm_operations.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_nn_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_ct_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_ct_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_cn_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_nh_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_nh_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_nc_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_ch_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_ch_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_cc_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_tn_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_tn_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_nt_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_hn_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_hn_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_ct_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_tc_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_tc_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_nh_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_hc_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_hc_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_ch_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_tt_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_tt_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_tn_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_ht_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_hn_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_ht_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_th_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_tc_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_th_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_hh_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_e5m2a_e4m3out.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_hc_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_hh_align8.cu.o [ 7%] Built target cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/all_sm75_h1688gemm_planar_complex_array_gemm_operations.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_nn_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_tt_align8.cu.o [ 7%] Built target cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_i88128xorgemm_b1_objs.dir/generated/gemm/75/i88128xorgemm_b1/all_sm75_i88128xorgemm_b1_gemm_operations.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_i88128xorgemm_b1_objs.dir/generated/gemm/75/i88128xorgemm_b1/cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_cn_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_ht_align8.cu.o [ 7%] Built target cutlass_library_gemm_sm75_i88128xorgemm_b1_objs [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_i8816gemm_s8_objs.dir/generated/gemm/75/i8816gemm_s8/all_sm75_i8816gemm_s8_gemm_operations.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_i8816gemm_s8_objs.dir/generated/gemm/75/i8816gemm_s8/cutlass_tensorop_i8816gemm_s8_256x128_64x2_tn_align16.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_th_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_nc_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_hh_align8.cu.o [ 8%] Built target cutlass_library_gemm_sm75_i8816gemm_s8_objs [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_i8816gemm_u8_objs.dir/generated/gemm/75/i8816gemm_u8/all_sm75_i8816gemm_u8_gemm_operations.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_cc_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_i8816gemm_u8_objs.dir/generated/gemm/75/i8816gemm_u8/cutlass_tensorop_i8816gemm_u8_256x128_64x2_tn_align16.cu.o [ 8%] Built target cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_i8832gemm_s4_objs.dir/generated/gemm/75/i8832gemm_s4/all_sm75_i8832gemm_s4_gemm_operations.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_i8832gemm_s4_objs.dir/generated/gemm/75/i8832gemm_s4/cutlass_tensorop_i8832gemm_s4_256x128_128x2_tn_align32.cu.o [ 8%] Built target cutlass_library_gemm_sm75_i8816gemm_u8_objs [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_i8832gemm_u4_objs.dir/generated/gemm/75/i8832gemm_u4/all_sm75_i8832gemm_u4_gemm_operations.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_nt_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_i8832gemm_u4_objs.dir/generated/gemm/75/i8832gemm_u4/cutlass_tensorop_i8832gemm_u4_256x128_128x2_tn_align32.cu.o [ 8%] Built target cutlass_library_gemm_sm75_i8832gemm_s4_objs [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_f16_objs.dir/generated/gemm/75/s1688gemm_f16/all_sm75_s1688gemm_f16_gemm_operations.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_f16_objs.dir/generated/gemm/75/s1688gemm_f16/cutlass_tensorop_s1688gemm_f16_256x128_32x2_nn_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_ct_align8.cu.o [ 8%] Built target cutlass_library_gemm_sm75_i8832gemm_u4_objs [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/all_sm75_s1688gemm_planar_complex_array_f16_gemm_operations.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_nn_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_f16_objs.dir/generated/gemm/75/s1688gemm_f16/cutlass_tensorop_s1688gemm_f16_256x128_32x2_nt_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_nh_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_cn_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_f16_objs.dir/generated/gemm/75/s1688gemm_f16/cutlass_tensorop_s1688gemm_f16_256x128_32x2_tn_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_nc_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_ch_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_f16_objs.dir/generated/gemm/75/s1688gemm_f16/cutlass_tensorop_s1688gemm_f16_256x128_32x2_tt_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_cc_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_tn_align8.cu.o [ 8%] Built target cutlass_library_gemm_sm75_s1688gemm_f16_objs [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_hn_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_nt_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_tc_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_hc_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_ct_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_tt_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_ht_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_nh_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_th_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_ch_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_hh_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_tn_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_hn_align8.cu.o [ 9%] Built target cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/all_sm75_s1688gemm_planar_complex_f16_gemm_operations.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_nn_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_tc_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_hc_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_cn_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_tt_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_ht_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_nc_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_th_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_hh_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_cc_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_nt_align8.cu.o [ 9%] Built target cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s4_i8832gemm_s4_objs.dir/generated/gemm/75/s4_i8832gemm_s4/all_sm75_s4_i8832gemm_s4_gemm_operations.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s4_i8832gemm_s4_objs.dir/generated/gemm/75/s4_i8832gemm_s4/cutlass_tensorop_s4_i8832gemm_s4_256x128_128x2_tn_align32.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_ct_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_nh_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s4_i8832gemm_s4_objs.dir/generated/gemm/75/s4_i8832gemm_s4/cutlass_tensorop_s4_i8832gemm_s4_256x128_128x2_n64t64_align32.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_ch_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_tn_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_hn_align8.cu.o [ 9%] Built target cutlass_library_gemm_sm75_s4_i8832gemm_s4_objs [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s8_i8816gemm_s8_objs.dir/generated/gemm/75/s8_i8816gemm_s8/all_sm75_s8_i8816gemm_s8_gemm_operations.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s8_i8816gemm_s8_objs.dir/generated/gemm/75/s8_i8816gemm_s8/cutlass_tensorop_s8_i8816gemm_s8_256x128_64x2_tn_align16.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_tc_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_hc_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s8_i8816gemm_s8_objs.dir/generated/gemm/75/s8_i8816gemm_s8/cutlass_tensorop_s8_i8816gemm_s8_256x128_64x2_n32t32_align16.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_tt_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_ht_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_th_align8.cu.o [ 9%] Built target cutlass_library_gemm_sm75_s8_i8816gemm_s8_objs [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_u4_i8832gemm_u4_objs.dir/generated/gemm/75/u4_i8832gemm_u4/all_sm75_u4_i8832gemm_u4_gemm_operations.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_u4_i8832gemm_u4_objs.dir/generated/gemm/75/u4_i8832gemm_u4/cutlass_tensorop_u4_i8832gemm_u4_256x128_128x2_tn_align32.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_hh_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_u4_i8832gemm_u4_objs.dir/generated/gemm/75/u4_i8832gemm_u4/cutlass_tensorop_u4_i8832gemm_u4_256x128_128x2_n64t64_align32.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_u8_i8816gemm_u8_objs.dir/generated/gemm/75/u8_i8816gemm_u8/all_sm75_u8_i8816gemm_u8_gemm_operations.cu.o [ 9%] Built target cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_bf16/all_sm80_bf16_s16816gemm_bf16_gemm_operations.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_u8_i8816gemm_u8_objs.dir/generated/gemm/75/u8_i8816gemm_u8/cutlass_tensorop_u8_i8816gemm_u8_256x128_64x2_tn_align16.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_bf16/cutlass_tensorop_bf16_s16816gemm_bf16_256x128_32x3_nn_align8.cu.o [ 9%] Built target cutlass_library_gemm_sm75_u4_i8832gemm_u4_objs [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_s8_objs.dir/generated/gemm/80/bf16_s16816gemm_bf16_s8/all_sm80_bf16_s16816gemm_bf16_s8_gemm_operations.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_s8_objs.dir/generated/gemm/80/bf16_s16816gemm_bf16_s8/cutlass_tensorop_bf16_s16816gemm_bf16_s8_128x128_64x4_tn_align16.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_u8_i8816gemm_u8_objs.dir/generated/gemm/75/u8_i8816gemm_u8/cutlass_tensorop_u8_i8816gemm_u8_256x128_64x2_n32t32_align16.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_bf16/cutlass_tensorop_bf16_s16816gemm_bf16_256x128_32x3_nt_align8.cu.o [ 9%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_s8_objs [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_u8_objs.dir/generated/gemm/80/bf16_s16816gemm_bf16_u8/all_sm80_bf16_s16816gemm_bf16_u8_gemm_operations.cu.o [ 9%] Built target cutlass_library_gemm_sm75_u8_i8816gemm_u8_objs [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_u8_objs.dir/generated/gemm/80/bf16_s16816gemm_bf16_u8/cutlass_tensorop_bf16_s16816gemm_bf16_u8_128x128_64x4_tn_align16.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/all_sm80_bf16_s16816gemm_planar_complex_array_bf16_gemm_operations.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_bf16/cutlass_tensorop_bf16_s16816gemm_bf16_256x128_32x3_tn_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_nn_align8.cu.o [ 9%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_u8_objs [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/all_sm80_bf16_s16816gemm_planar_complex_bf16_gemm_operations.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_bf16/cutlass_tensorop_bf16_s16816gemm_bf16_256x128_32x3_tt_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_cn_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_nn_align8.cu.o [ 9%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_objs [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_s8_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_s8_bf16/all_sm80_bf16_s16816gemm_s8_bf16_gemm_operations.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_nc_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_cn_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_s8_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_s8_bf16/cutlass_tensorop_bf16_s16816gemm_s8_bf16_128x128_64x4_tn_align16.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_cc_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_nc_align8.cu.o [ 10%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_s8_bf16_objs [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_u8_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_u8_bf16/all_sm80_bf16_s16816gemm_u8_bf16_gemm_operations.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_e4m3a_e5m2out.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_u8_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_u8_bf16/cutlass_tensorop_bf16_s16816gemm_u8_bf16_128x128_64x4_tn_align16.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_cc_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_nt_align8.cu.o [ 10%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_u8_bf16_objs [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16832spgemm_bf16_objs.dir/generated/gemm/80/bf16_s16832spgemm_bf16/all_sm80_bf16_s16832spgemm_bf16_gemm_operations.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_nt_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_ct_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16832spgemm_bf16_objs.dir/generated/gemm/80/bf16_s16832spgemm_bf16/cutlass_tensorop_bf16_s16832spgemm_bf16_64x128_64x6_nn_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_ct_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_nh_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16832spgemm_bf16_objs.dir/generated/gemm/80/bf16_s16832spgemm_bf16/cutlass_tensorop_bf16_s16832spgemm_bf16_64x128_64x6_nt_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_nh_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_ch_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16832spgemm_bf16_objs.dir/generated/gemm/80/bf16_s16832spgemm_bf16/cutlass_tensorop_bf16_s16832spgemm_bf16_64x128_64x6_tn_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_ch_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_tn_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16832spgemm_bf16_objs.dir/generated/gemm/80/bf16_s16832spgemm_bf16/cutlass_tensorop_bf16_s16832spgemm_bf16_64x128_64x6_tt_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_tn_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_hn_align8.cu.o [ 10%] Built target cutlass_library_gemm_sm80_bf16_s16832spgemm_bf16_objs [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/all_sm80_c1688gemm_gemm_operations.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_nn_align1.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_hn_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_tc_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_tc_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_cn_align1.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_hc_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_hc_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_nc_align1.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_tt_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_tt_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_cc_align1.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_ht_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_ht_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_nt_align1.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_th_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_th_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_ct_align1.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_hh_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_hh_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_nh_align1.cu.o [ 10%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/all_sm80_c1688tf32gemm_gemm_operations.cu.o [ 10%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/all_sm80_cgemm_gemm_operations.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_nn_align1.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_nn_align1.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_ch_align1.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_cn_align1.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_cn_align1.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_tn_align1.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_nc_align1.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_hn_align1.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_nc_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_cc_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_tc_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_cc_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_nt_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_hc_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_ct_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_nt_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_tt_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_nh_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_ct_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_ht_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_ch_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_nh_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_th_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_tn_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_hh_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_ch_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_hn_align1.cu.o [ 11%] Built target cutlass_library_gemm_sm80_c1688gemm_objs [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_d884gemm_objs.dir/generated/gemm/80/d884gemm/all_sm80_d884gemm_gemm_operations.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_tn_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_d884gemm_objs.dir/generated/gemm/80/d884gemm/cutlass_tensorop_d884gemm_128x128_16x3_nn_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_tc_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_d884gemm_objs.dir/generated/gemm/80/d884gemm/cutlass_tensorop_d884gemm_128x128_16x3_nt_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_hn_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_hc_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_d884gemm_objs.dir/generated/gemm/80/d884gemm/cutlass_tensorop_d884gemm_128x128_16x3_tn_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_tt_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_tc_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_d884gemm_objs.dir/generated/gemm/80/d884gemm/cutlass_tensorop_d884gemm_128x128_16x3_tt_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_ht_align1.cu.o [ 12%] Built target cutlass_library_gemm_sm80_d884gemm_objs [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_dgemm_objs.dir/generated/gemm/80/dgemm/all_sm80_dgemm_gemm_operations.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_hc_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_dgemm_objs.dir/generated/gemm/80/dgemm/cutlass_simt_dgemm_128x128_8x3_nn_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_th_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_tt_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_dgemm_objs.dir/generated/gemm/80/dgemm/cutlass_simt_dgemm_128x128_8x3_nt_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_hh_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_dgemm_objs.dir/generated/gemm/80/dgemm/cutlass_simt_dgemm_128x128_8x3_tn_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_ht_align1.cu.o [ 12%] Built target cutlass_library_gemm_sm80_c1688tf32gemm_objs [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_f16_objs.dir/generated/gemm/80/f16_s16816gemm_f16/all_sm80_f16_s16816gemm_f16_gemm_operations.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_f16_objs.dir/generated/gemm/80/f16_s16816gemm_f16/cutlass_tensorop_f16_s16816gemm_f16_256x128_32x3_nn_align8.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_dgemm_objs.dir/generated/gemm/80/dgemm/cutlass_simt_dgemm_128x128_8x3_tt_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_th_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_e5m2a_e5m2out.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_f16_objs.dir/generated/gemm/80/f16_s16816gemm_f16/cutlass_tensorop_f16_s16816gemm_f16_256x128_32x3_nt_align8.cu.o [ 12%] Built target cutlass_library_gemm_sm80_dgemm_objs [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_f16_s8_objs.dir/generated/gemm/80/f16_s16816gemm_f16_s8/all_sm80_f16_s16816gemm_f16_s8_gemm_operations.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_f16_s8_objs.dir/generated/gemm/80/f16_s16816gemm_f16_s8/cutlass_tensorop_f16_s16816gemm_f16_s8_128x128_64x4_tn_align16.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_f16_objs.dir/generated/gemm/80/f16_s16816gemm_f16/cutlass_tensorop_f16_s16816gemm_f16_256x128_32x3_tn_align8.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_hh_align1.cu.o [ 12%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_f16_s8_objs [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_f16_u8_objs.dir/generated/gemm/80/f16_s16816gemm_f16_u8/all_sm80_f16_s16816gemm_f16_u8_gemm_operations.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_f16_objs.dir/generated/gemm/80/f16_s16816gemm_f16/cutlass_tensorop_f16_s16816gemm_f16_256x128_32x3_tt_align8.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_f16_u8_objs.dir/generated/gemm/80/f16_s16816gemm_f16_u8/cutlass_tensorop_f16_s16816gemm_f16_u8_128x128_64x4_tn_align16.cu.o [ 12%] Built target cutlass_library_gemm_sm80_cgemm_objs [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/all_sm80_f16_s16816gemm_planar_complex_array_f16_gemm_operations.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_nn_align8.cu.o [ 12%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_f16_objs [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/all_sm80_f16_s16816gemm_planar_complex_f16_gemm_operations.cu.o [ 12%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_f16_u8_objs [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_s8_f16_objs.dir/generated/gemm/80/f16_s16816gemm_s8_f16/all_sm80_f16_s16816gemm_s8_f16_gemm_operations.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_nn_align8.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_s8_f16_objs.dir/generated/gemm/80/f16_s16816gemm_s8_f16/cutlass_tensorop_f16_s16816gemm_s8_f16_128x128_64x4_tn_align16.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_cn_align8.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_cn_align8.cu.o [ 12%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_s8_f16_objs [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_u8_f16_objs.dir/generated/gemm/80/f16_s16816gemm_u8_f16/all_sm80_f16_s16816gemm_u8_f16_gemm_operations.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_nc_align8.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_u8_f16_objs.dir/generated/gemm/80/f16_s16816gemm_u8_f16/cutlass_tensorop_f16_s16816gemm_u8_f16_128x128_64x4_tn_align16.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_nc_align8.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_cc_align8.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_cc_align8.cu.o [ 12%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_u8_f16_objs [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16832spgemm_f16_objs.dir/generated/gemm/80/f16_s16832spgemm_f16/all_sm80_f16_s16832spgemm_f16_gemm_operations.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16832spgemm_f16_objs.dir/generated/gemm/80/f16_s16832spgemm_f16/cutlass_tensorop_f16_s16832spgemm_f16_64x128_64x6_nn_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_nt_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_nt_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16832spgemm_f16_objs.dir/generated/gemm/80/f16_s16832spgemm_f16/cutlass_tensorop_f16_s16832spgemm_f16_64x128_64x6_nt_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_ct_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_ct_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16832spgemm_f16_objs.dir/generated/gemm/80/f16_s16832spgemm_f16/cutlass_tensorop_f16_s16832spgemm_f16_64x128_64x6_tn_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_nh_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_nh_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16832spgemm_f16_objs.dir/generated/gemm/80/f16_s16832spgemm_f16/cutlass_tensorop_f16_s16832spgemm_f16_64x128_64x6_tt_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_ch_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_ch_align8.cu.o [ 13%] Built target cutlass_library_gemm_sm80_f16_s16832spgemm_f16_objs [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/all_sm80_gz884gemm_gemm_operations.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_tn_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_tn_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_nn_align1.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_hn_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_hn_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_cn_align1.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_tc_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_nc_align1.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_tc_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_cc_align1.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_hc_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_hc_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_nt_align1.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_tt_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_tt_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_ct_align1.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_ht_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_ht_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_nh_align1.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_th_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_th_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_ch_align1.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_hh_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_hh_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_tn_align1.cu.o [ 14%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_objs.dir/generated/gemm/80/h16816gemm/all_sm80_h16816gemm_gemm_operations.cu.o [ 14%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_grouped_objs.dir/generated/gemm/80/h16816gemm_grouped/all_sm80_h16816gemm_grouped_gemm_operations.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_objs.dir/generated/gemm/80/h16816gemm/cutlass_tensorop_h16816gemm_256x128_32x3_nn_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_hn_align1.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_grouped_objs.dir/generated/gemm/80/h16816gemm_grouped/cutlass_tensorop_h16816gemm_grouped_256x128_32x3_nn_align8_scheduleDevice.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_tc_align1.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_objs.dir/generated/gemm/80/h16816gemm/cutlass_tensorop_h16816gemm_256x128_32x3_nt_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_grouped_objs.dir/generated/gemm/80/h16816gemm_grouped/cutlass_tensorop_h16816gemm_grouped_256x128_32x3_nt_align8_scheduleDevice.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_hc_align1.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_objs.dir/generated/gemm/80/h16816gemm/cutlass_tensorop_h16816gemm_256x128_32x3_tn_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_grouped_objs.dir/generated/gemm/80/h16816gemm_grouped/cutlass_tensorop_h16816gemm_grouped_256x128_32x3_tn_align8_scheduleDevice.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_tt_align1.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_objs.dir/generated/gemm/80/h16816gemm/cutlass_tensorop_h16816gemm_256x128_32x3_tt_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_grouped_objs.dir/generated/gemm/80/h16816gemm_grouped/cutlass_tensorop_h16816gemm_grouped_256x128_32x3_tt_align8_scheduleDevice.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_ht_align1.cu.o [ 15%] Built target cutlass_library_gemm_sm80_h16816gemm_objs [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_th_align1.cu.o [ 15%] Built target cutlass_library_gemm_sm80_h16816gemm_grouped_objs [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/all_sm80_h16816gemm_planar_complex_gemm_operations.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_hh_align1.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_nn_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_cn_align8.cu.o [ 15%] Built target cutlass_library_gemm_sm80_gz884gemm_objs [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/all_sm80_h16816gemm_planar_complex_array_gemm_operations.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_nc_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_cc_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_nn_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_nt_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_ct_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_cn_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_nh_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_ch_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_nc_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_tn_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_hn_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_cc_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_tc_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_hc_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_nt_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_tt_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_ht_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_ct_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_th_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_hh_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_nh_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_ch_align8.cu.o [ 16%] Built target cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_s8_f16_objs.dir/generated/gemm/80/h16816gemm_s8_f16/all_sm80_h16816gemm_s8_f16_gemm_operations.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_tn_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_s8_f16_objs.dir/generated/gemm/80/h16816gemm_s8_f16/cutlass_tensorop_h16816gemm_s8_f16_128x128_64x4_tn_align16.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_fp8in_fp16out.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_hn_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_tc_align8.cu.o [ 16%] Built target cutlass_library_gemm_sm80_h16816gemm_s8_f16_objs [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16832spgemm_objs.dir/generated/gemm/80/h16832spgemm/all_sm80_h16832spgemm_gemm_operations.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_hc_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16832spgemm_objs.dir/generated/gemm/80/h16832spgemm/cutlass_tensorop_h16832spgemm_64x128_64x6_nn_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_tt_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_ht_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16832spgemm_objs.dir/generated/gemm/80/h16832spgemm/cutlass_tensorop_h16832spgemm_64x128_64x6_nt_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_th_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_hh_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16832spgemm_objs.dir/generated/gemm/80/h16832spgemm/cutlass_tensorop_h16832spgemm_64x128_64x6_tn_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16832spgemm_objs.dir/generated/gemm/80/h16832spgemm/cutlass_tensorop_h16832spgemm_64x128_64x6_tt_align8.cu.o [ 16%] Built target cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i168128spgemm_s4_objs.dir/generated/gemm/80/i168128spgemm_s4/all_sm80_i168128spgemm_s4_gemm_operations.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i168128spgemm_s4_objs.dir/generated/gemm/80/i168128spgemm_s4/cutlass_tensorop_i168128spgemm_s4_64x64_256x4_tn_align32.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i168256andgemm_b1_objs.dir/generated/gemm/80/i168256andgemm_b1/all_sm80_i168256andgemm_b1_gemm_operations.cu.o [ 16%] Built target cutlass_library_gemm_sm80_h16832spgemm_objs [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i168256xorgemm_b1_objs.dir/generated/gemm/80/i168256xorgemm_b1/all_sm80_i168256xorgemm_b1_gemm_operations.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i168256andgemm_b1_objs.dir/generated/gemm/80/i168256andgemm_b1/cutlass_tensorop_i168256andgemm_b1_256x128_512x3_tn_align128.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i168256xorgemm_b1_objs.dir/generated/gemm/80/i168256xorgemm_b1/cutlass_tensorop_i168256xorgemm_b1_256x128_512x3_tn_align128.cu.o [ 16%] Built target cutlass_library_gemm_sm80_i168128spgemm_s4_objs [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i16832gemm_s8_objs.dir/generated/gemm/80/i16832gemm_s8/all_sm80_i16832gemm_s8_gemm_operations.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i16832gemm_s8_objs.dir/generated/gemm/80/i16832gemm_s8/cutlass_tensorop_i16832gemm_s8_256x128_64x3_tn_align16.cu.o [ 16%] Built target cutlass_library_gemm_sm80_i168256andgemm_b1_objs [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i16832gemm_u8_objs.dir/generated/gemm/80/i16832gemm_u8/all_sm80_i16832gemm_u8_gemm_operations.cu.o [ 16%] Built target cutlass_library_gemm_sm80_i168256xorgemm_b1_objs [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i16864gemm_s4_objs.dir/generated/gemm/80/i16864gemm_s4/all_sm80_i16864gemm_s4_gemm_operations.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i16832gemm_u8_objs.dir/generated/gemm/80/i16832gemm_u8/cutlass_tensorop_i16832gemm_u8_256x128_64x3_tn_align16.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i16864gemm_s4_objs.dir/generated/gemm/80/i16864gemm_s4/cutlass_tensorop_i16864gemm_s4_256x128_128x3_tn_align32.cu.o [ 16%] Built target cutlass_library_gemm_sm80_i16832gemm_s8_objs [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i16864gemm_u4_objs.dir/generated/gemm/80/i16864gemm_u4/all_sm80_i16864gemm_u4_gemm_operations.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i16864gemm_u4_objs.dir/generated/gemm/80/i16864gemm_u4/cutlass_tensorop_i16864gemm_u4_256x128_128x3_tn_align32.cu.o [ 16%] Built target cutlass_library_gemm_sm80_i16832gemm_u8_objs [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i16864spgemm_s8_objs.dir/generated/gemm/80/i16864spgemm_s8/all_sm80_i16864spgemm_s8_gemm_operations.cu.o [ 16%] Built target cutlass_library_gemm_sm80_i16864gemm_s4_objs [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i16864spgemm_s8_objs.dir/generated/gemm/80/i16864spgemm_s8/cutlass_tensorop_i16864spgemm_s8_128x64_128x3_tn_align16.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_bf16_objs.dir/generated/gemm/80/s16816gemm_bf16/all_sm80_s16816gemm_bf16_gemm_operations.cu.o [ 16%] Built target cutlass_library_gemm_sm80_i16864gemm_u4_objs [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_bf16_s8_objs.dir/generated/gemm/80/s16816gemm_bf16_s8/all_sm80_s16816gemm_bf16_s8_gemm_operations.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_bf16_objs.dir/generated/gemm/80/s16816gemm_bf16/cutlass_tensorop_s16816gemm_bf16_256x128_32x3_nn_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_bf16_s8_objs.dir/generated/gemm/80/s16816gemm_bf16_s8/cutlass_tensorop_s16816gemm_bf16_s8_128x128_64x4_tn_align16.cu.o [ 16%] Built target cutlass_library_gemm_sm80_i16864spgemm_s8_objs [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_bf16_u8_objs.dir/generated/gemm/80/s16816gemm_bf16_u8/all_sm80_s16816gemm_bf16_u8_gemm_operations.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_bf16_u8_objs.dir/generated/gemm/80/s16816gemm_bf16_u8/cutlass_tensorop_s16816gemm_bf16_u8_128x128_64x4_tn_align16.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_bf16_objs.dir/generated/gemm/80/s16816gemm_bf16/cutlass_tensorop_s16816gemm_bf16_256x128_32x3_nt_align8.cu.o [ 16%] Built target cutlass_library_gemm_sm80_s16816gemm_bf16_s8_objs [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_f16_objs.dir/generated/gemm/80/s16816gemm_f16/all_sm80_s16816gemm_f16_gemm_operations.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_f16_objs.dir/generated/gemm/80/s16816gemm_f16/cutlass_tensorop_s16816gemm_f16_256x128_32x3_nn_align8.cu.o [ 16%] Built target cutlass_library_gemm_sm80_s16816gemm_bf16_u8_objs [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_f16_s8_objs.dir/generated/gemm/80/s16816gemm_f16_s8/all_sm80_s16816gemm_f16_s8_gemm_operations.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_bf16_objs.dir/generated/gemm/80/s16816gemm_bf16/cutlass_tensorop_s16816gemm_bf16_256x128_32x3_tn_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_f16_s8_objs.dir/generated/gemm/80/s16816gemm_f16_s8/cutlass_tensorop_s16816gemm_f16_s8_128x128_64x4_tn_align16.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_f16_objs.dir/generated/gemm/80/s16816gemm_f16/cutlass_tensorop_s16816gemm_f16_256x128_32x3_nt_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_bf16_objs.dir/generated/gemm/80/s16816gemm_bf16/cutlass_tensorop_s16816gemm_bf16_256x128_32x3_tt_align8.cu.o [ 16%] Built target cutlass_library_gemm_sm80_s16816gemm_f16_s8_objs [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_f16_u8_objs.dir/generated/gemm/80/s16816gemm_f16_u8/all_sm80_s16816gemm_f16_u8_gemm_operations.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_f16_u8_objs.dir/generated/gemm/80/s16816gemm_f16_u8/cutlass_tensorop_s16816gemm_f16_u8_128x128_64x4_tn_align16.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_f16_objs.dir/generated/gemm/80/s16816gemm_f16/cutlass_tensorop_s16816gemm_f16_256x128_32x3_tn_align8.cu.o [ 16%] Built target cutlass_library_gemm_sm80_s16816gemm_bf16_objs [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_grouped_bf16_objs.dir/generated/gemm/80/s16816gemm_grouped_bf16/all_sm80_s16816gemm_grouped_bf16_gemm_operations.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_grouped_bf16_objs.dir/generated/gemm/80/s16816gemm_grouped_bf16/cutlass_tensorop_s16816gemm_grouped_bf16_256x128_32x3_nn_align8_scheduleDevice.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_f16_objs.dir/generated/gemm/80/s16816gemm_f16/cutlass_tensorop_s16816gemm_f16_256x128_32x3_tt_align8.cu.o [ 16%] Built target cutlass_library_gemm_sm80_s16816gemm_f16_u8_objs [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_grouped_f16_objs.dir/generated/gemm/80/s16816gemm_grouped_f16/all_sm80_s16816gemm_grouped_f16_gemm_operations.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_grouped_f16_objs.dir/generated/gemm/80/s16816gemm_grouped_f16/cutlass_tensorop_s16816gemm_grouped_f16_256x128_32x3_nn_align8_scheduleDevice.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_grouped_bf16_objs.dir/generated/gemm/80/s16816gemm_grouped_bf16/cutlass_tensorop_s16816gemm_grouped_bf16_256x128_32x3_nt_align8_scheduleDevice.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_fp8in_bf16out.cu.o [ 16%] Built target cutlass_library_gemm_sm80_s16816gemm_f16_objs [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/all_sm80_s16816gemm_planar_complex_array_bf16_gemm_operations.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_nn_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_grouped_f16_objs.dir/generated/gemm/80/s16816gemm_grouped_f16/cutlass_tensorop_s16816gemm_grouped_f16_256x128_32x3_nt_align8_scheduleDevice.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_grouped_bf16_objs.dir/generated/gemm/80/s16816gemm_grouped_bf16/cutlass_tensorop_s16816gemm_grouped_bf16_256x128_32x3_tn_align8_scheduleDevice.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_cn_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_grouped_f16_objs.dir/generated/gemm/80/s16816gemm_grouped_f16/cutlass_tensorop_s16816gemm_grouped_f16_256x128_32x3_tn_align8_scheduleDevice.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_grouped_bf16_objs.dir/generated/gemm/80/s16816gemm_grouped_bf16/cutlass_tensorop_s16816gemm_grouped_bf16_256x128_32x3_tt_align8_scheduleDevice.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_nc_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_grouped_f16_objs.dir/generated/gemm/80/s16816gemm_grouped_f16/cutlass_tensorop_s16816gemm_grouped_f16_256x128_32x3_tt_align8_scheduleDevice.cu.o [ 16%] Built target cutlass_library_gemm_sm80_s16816gemm_grouped_bf16_objs [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_cc_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_nt_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_ct_align8.cu.o [ 16%] Built target cutlass_library_gemm_sm80_s16816gemm_grouped_f16_objs [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/all_sm80_s16816gemm_planar_complex_array_f16_gemm_operations.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_nn_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_nh_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_ch_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_cn_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_nc_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_tn_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_cc_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_nt_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_hn_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_ct_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_nh_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_tc_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_ch_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_tn_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_hc_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_hn_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_tc_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_tt_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_hc_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_tt_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_ht_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_ht_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_th_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_th_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_hh_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_hh_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/all_sm80_s16816gemm_planar_complex_bf16_gemm_operations.cu.o [ 17%] Built target cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/all_sm80_s16816gemm_planar_complex_f16_gemm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_nn_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_nn_align8.cu.o [ 17%] Built target cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_s8_bf16_objs.dir/generated/gemm/80/s16816gemm_s8_bf16/all_sm80_s16816gemm_s8_bf16_gemm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_s8_bf16_objs.dir/generated/gemm/80/s16816gemm_s8_bf16/cutlass_tensorop_s16816gemm_s8_bf16_128x128_64x4_tn_align16.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_fp8in_fp32out.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_cn_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_cn_align8.cu.o [ 17%] Built target cutlass_library_gemm_sm80_s16816gemm_s8_bf16_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_s8_f16_objs.dir/generated/gemm/80/s16816gemm_s8_f16/all_sm80_s16816gemm_s8_f16_gemm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_nc_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_s8_f16_objs.dir/generated/gemm/80/s16816gemm_s8_f16/cutlass_tensorop_s16816gemm_s8_f16_128x128_64x4_tn_align16.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_nc_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_cc_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_cc_align8.cu.o [ 17%] Built target cutlass_library_gemm_sm80_s16816gemm_s8_f16_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_u8_bf16_objs.dir/generated/gemm/80/s16816gemm_u8_bf16/all_sm80_s16816gemm_u8_bf16_gemm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_u8_bf16_objs.dir/generated/gemm/80/s16816gemm_u8_bf16/cutlass_tensorop_s16816gemm_u8_bf16_128x128_64x4_tn_align16.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_nt_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_nt_align8.cu.o [ 17%] Built target cutlass_library_gemm_sm80_s16816gemm_u8_bf16_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_cf32_cfprop_optimized_cf32_objs.dir/generated/conv2d/75/cf32_cfprop_optimized_cf32/all_sm75_cf32_cfprop_optimized_cf32_conv2d_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_ct_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_ct_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_cf32_cfprop_optimized_cf32_objs.dir/generated/conv2d/75/cf32_cfprop_optimized_cf32/cutlass_simt_cf32_cfprop_optimized_cf32_128x128_8x5_nhwc_align1.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_nh_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_nh_align8.cu.o [ 17%] Built target cutlass_library_conv2d_sm75_cf32_cfprop_optimized_cf32_objs [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_u8_f16_objs.dir/generated/gemm/80/s16816gemm_u8_f16/all_sm80_s16816gemm_u8_f16_gemm_operations.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_u8_f16_objs.dir/generated/gemm/80/s16816gemm_u8_f16/cutlass_tensorop_s16816gemm_u8_f16_128x128_64x4_tn_align16.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_ch_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_ch_align8.cu.o [ 18%] Built target cutlass_library_gemm_sm80_s16816gemm_u8_f16_objs [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816tf32spgemm_objs.dir/generated/gemm/80/s16816tf32spgemm/all_sm80_s16816tf32spgemm_gemm_operations.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_tn_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_tn_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816tf32spgemm_objs.dir/generated/gemm/80/s16816tf32spgemm/cutlass_tensorop_s16816tf32spgemm_128x64_32x3_nn_align4.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_hn_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_hn_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816tf32spgemm_objs.dir/generated/gemm/80/s16816tf32spgemm/cutlass_tensorop_s16816tf32spgemm_128x64_32x3_nt_align4.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_tc_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_tc_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816tf32spgemm_objs.dir/generated/gemm/80/s16816tf32spgemm/cutlass_tensorop_s16816tf32spgemm_128x64_32x3_tn_align4.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_hc_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_hc_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816tf32spgemm_objs.dir/generated/gemm/80/s16816tf32spgemm/cutlass_tensorop_s16816tf32spgemm_128x64_32x3_tt_align4.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_tt_align8.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_tt_align8.cu.o [ 19%] Built target cutlass_library_gemm_sm80_s16816tf32spgemm_objs [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16832spgemm_bf16_objs.dir/generated/gemm/80/s16832spgemm_bf16/all_sm80_s16832spgemm_bf16_gemm_operations.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16832spgemm_bf16_objs.dir/generated/gemm/80/s16832spgemm_bf16/cutlass_tensorop_s16832spgemm_bf16_64x128_64x6_nn_align8.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_ht_align8.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_ht_align8.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16832spgemm_bf16_objs.dir/generated/gemm/80/s16832spgemm_bf16/cutlass_tensorop_s16832spgemm_bf16_64x128_64x6_nt_align8.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_th_align8.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_th_align8.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_fp32out.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16832spgemm_bf16_objs.dir/generated/gemm/80/s16832spgemm_bf16/cutlass_tensorop_s16832spgemm_bf16_64x128_64x6_tn_align8.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_hh_align8.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_hh_align8.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16832spgemm_bf16_objs.dir/generated/gemm/80/s16832spgemm_bf16/cutlass_tensorop_s16832spgemm_bf16_64x128_64x6_tt_align8.cu.o [ 19%] Built target cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16832spgemm_f16_objs.dir/generated/gemm/80/s16832spgemm_f16/all_sm80_s16832spgemm_f16_gemm_operations.cu.o [ 19%] Built target cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688bf16gemm_objs.dir/generated/gemm/80/s1688bf16gemm/all_sm80_s1688bf16gemm_gemm_operations.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16832spgemm_f16_objs.dir/generated/gemm/80/s16832spgemm_f16/cutlass_tensorop_s16832spgemm_f16_64x128_64x6_nn_align8.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688bf16gemm_objs.dir/generated/gemm/80/s1688bf16gemm/cutlass_tensorop_s1688bf16gemm_256x128_16x3_nn_align4.cu.o [ 19%] Built target cutlass_library_gemm_sm80_s16832spgemm_bf16_objs [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688f16gemm_objs.dir/generated/gemm/80/s1688f16gemm/all_sm80_s1688f16gemm_gemm_operations.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16832spgemm_f16_objs.dir/generated/gemm/80/s16832spgemm_f16/cutlass_tensorop_s16832spgemm_f16_64x128_64x6_nt_align8.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688f16gemm_objs.dir/generated/gemm/80/s1688f16gemm/cutlass_tensorop_s1688f16gemm_256x128_16x3_nn_align4.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688bf16gemm_objs.dir/generated/gemm/80/s1688bf16gemm/cutlass_tensorop_s1688bf16gemm_256x128_16x3_nt_align4.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688f16gemm_objs.dir/generated/gemm/80/s1688f16gemm/cutlass_tensorop_s1688f16gemm_256x128_16x3_nt_align4.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688bf16gemm_objs.dir/generated/gemm/80/s1688bf16gemm/cutlass_tensorop_s1688bf16gemm_256x128_16x3_tn_align4.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16832spgemm_f16_objs.dir/generated/gemm/80/s16832spgemm_f16/cutlass_tensorop_s16832spgemm_f16_64x128_64x6_tn_align8.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688f16gemm_objs.dir/generated/gemm/80/s1688f16gemm/cutlass_tensorop_s1688f16gemm_256x128_16x3_tn_align4.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688bf16gemm_objs.dir/generated/gemm/80/s1688bf16gemm/cutlass_tensorop_s1688bf16gemm_256x128_16x3_tt_align4.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16832spgemm_f16_objs.dir/generated/gemm/80/s16832spgemm_f16/cutlass_tensorop_s16832spgemm_f16_64x128_64x6_tt_align8.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688f16gemm_objs.dir/generated/gemm/80/s1688f16gemm/cutlass_tensorop_s1688f16gemm_256x128_16x3_tt_align4.cu.o [ 20%] Built target cutlass_library_gemm_sm80_s1688bf16gemm_objs [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688gemm_objs.dir/generated/gemm/80/s1688gemm/all_sm80_s1688gemm_gemm_operations.cu.o [ 20%] Built target cutlass_library_gemm_sm80_s16832spgemm_f16_objs [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688gemm_tf32_objs.dir/generated/gemm/80/s1688gemm_tf32/all_sm80_s1688gemm_tf32_gemm_operations.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688gemm_objs.dir/generated/gemm/80/s1688gemm/cutlass_tensorop_s1688gemm_128x128_16x4_nn_align4.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688gemm_tf32_objs.dir/generated/gemm/80/s1688gemm_tf32/cutlass_tensorop_s1688gemm_tf32_256x128_16x3_nn_align4.cu.o [ 20%] Built target cutlass_library_gemm_sm80_s1688f16gemm_objs [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688tf32gemm_objs.dir/generated/gemm/80/s1688tf32gemm/all_sm80_s1688tf32gemm_gemm_operations.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688gemm_objs.dir/generated/gemm/80/s1688gemm/cutlass_tensorop_s1688gemm_128x128_16x4_nt_align4.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688tf32gemm_objs.dir/generated/gemm/80/s1688tf32gemm/cutlass_tensorop_s1688tf32gemm_256x128_16x3_nn_align4.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688gemm_tf32_objs.dir/generated/gemm/80/s1688gemm_tf32/cutlass_tensorop_s1688gemm_tf32_256x128_16x3_nt_align4.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688gemm_objs.dir/generated/gemm/80/s1688gemm/cutlass_tensorop_s1688gemm_128x128_16x4_tn_align4.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688tf32gemm_objs.dir/generated/gemm/80/s1688tf32gemm/cutlass_tensorop_s1688tf32gemm_256x128_16x3_nt_align4.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688gemm_tf32_objs.dir/generated/gemm/80/s1688gemm_tf32/cutlass_tensorop_s1688gemm_tf32_256x128_16x3_tn_align4.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688gemm_objs.dir/generated/gemm/80/s1688gemm/cutlass_tensorop_s1688gemm_128x128_16x4_tt_align4.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688tf32gemm_objs.dir/generated/gemm/80/s1688tf32gemm/cutlass_tensorop_s1688tf32gemm_256x128_16x3_tn_align4.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688gemm_tf32_objs.dir/generated/gemm/80/s1688gemm_tf32/cutlass_tensorop_s1688gemm_tf32_256x128_16x3_tt_align4.cu.o [ 20%] Built target cutlass_library_gemm_sm80_s1688gemm_objs [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s4_i168128spgemm_s4_objs.dir/generated/gemm/80/s4_i168128spgemm_s4/all_sm80_s4_i168128spgemm_s4_gemm_operations.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688tf32gemm_objs.dir/generated/gemm/80/s1688tf32gemm/cutlass_tensorop_s1688tf32gemm_256x128_16x3_tt_align4.cu.o [ 20%] Built target cutlass_library_gemm_sm80_s1688gemm_tf32_objs [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s4_i16864gemm_s4_objs.dir/generated/gemm/80/s4_i16864gemm_s4/all_sm80_s4_i16864gemm_s4_gemm_operations.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s4_i168128spgemm_s4_objs.dir/generated/gemm/80/s4_i168128spgemm_s4/cutlass_tensorop_s4_i168128spgemm_s4_64x64_256x4_tn_align32.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s4_i16864gemm_s4_objs.dir/generated/gemm/80/s4_i16864gemm_s4/cutlass_tensorop_s4_i16864gemm_s4_256x128_128x3_tn_align32.cu.o [ 20%] Built target cutlass_library_gemm_sm80_s1688tf32gemm_objs [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s8_i16832gemm_s8_objs.dir/generated/gemm/80/s8_i16832gemm_s8/all_sm80_s8_i16832gemm_s8_gemm_operations.cu.o [ 20%] Built target cutlass_library_gemm_sm80_s4_i168128spgemm_s4_objs [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s8_i16864spgemm_s8_objs.dir/generated/gemm/80/s8_i16864spgemm_s8/all_sm80_s8_i16864spgemm_s8_gemm_operations.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s8_i16832gemm_s8_objs.dir/generated/gemm/80/s8_i16832gemm_s8/cutlass_tensorop_s8_i16832gemm_s8_256x128_64x3_tn_align16.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s8_i16864spgemm_s8_objs.dir/generated/gemm/80/s8_i16864spgemm_s8/cutlass_tensorop_s8_i16864spgemm_s8_128x64_128x3_tn_align16.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s4_i16864gemm_s4_objs.dir/generated/gemm/80/s4_i16864gemm_s4/cutlass_tensorop_s4_i16864gemm_s4_256x128_128x3_n64t64_align32.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_fp_other.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s8_i16832gemm_s8_objs.dir/generated/gemm/80/s8_i16832gemm_s8/cutlass_tensorop_s8_i16832gemm_s8_256x128_64x3_n32t32_align16.cu.o [ 20%] Built target cutlass_library_gemm_sm80_s4_i16864gemm_s4_objs [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/all_sm80_sgemm_gemm_operations.cu.o [ 20%] Built target cutlass_library_gemm_sm80_s8_i16864spgemm_s8_objs [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_tf32_s1688gemm_tf32_objs.dir/generated/gemm/80/tf32_s1688gemm_tf32/all_sm80_tf32_s1688gemm_tf32_gemm_operations.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_256x128_8x5_nn_align1.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_tf32_s1688gemm_tf32_objs.dir/generated/gemm/80/tf32_s1688gemm_tf32/cutlass_tensorop_tf32_s1688gemm_tf32_256x128_16x3_nn_align4.cu.o [ 20%] Built target cutlass_library_gemm_sm80_s8_i16832gemm_s8_objs [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_u4_i16864gemm_u4_objs.dir/generated/gemm/80/u4_i16864gemm_u4/all_sm80_u4_i16864gemm_u4_gemm_operations.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_u4_i16864gemm_u4_objs.dir/generated/gemm/80/u4_i16864gemm_u4/cutlass_tensorop_u4_i16864gemm_u4_256x128_128x3_tn_align32.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_tf32_s1688gemm_tf32_objs.dir/generated/gemm/80/tf32_s1688gemm_tf32/cutlass_tensorop_tf32_s1688gemm_tf32_256x128_16x3_nt_align4.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_256x128_8x5_nt_align1.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_u4_i16864gemm_u4_objs.dir/generated/gemm/80/u4_i16864gemm_u4/cutlass_tensorop_u4_i16864gemm_u4_256x128_128x3_n64t64_align32.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_tf32_s1688gemm_tf32_objs.dir/generated/gemm/80/tf32_s1688gemm_tf32/cutlass_tensorop_tf32_s1688gemm_tf32_256x128_16x3_tn_align4.cu.o [ 20%] Built target cutlass_library_gemm_sm80_u4_i16864gemm_u4_objs [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_u8_i16832gemm_u8_objs.dir/generated/gemm/80/u8_i16832gemm_u8/all_sm80_u8_i16832gemm_u8_gemm_operations.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_256x128_8x5_tn_align1.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_u8_i16832gemm_u8_objs.dir/generated/gemm/80/u8_i16832gemm_u8/cutlass_tensorop_u8_i16832gemm_u8_256x128_64x3_tn_align16.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_tf32_s1688gemm_tf32_objs.dir/generated/gemm/80/tf32_s1688gemm_tf32/cutlass_tensorop_tf32_s1688gemm_tf32_256x128_16x3_tt_align4.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_u8_i16832gemm_u8_objs.dir/generated/gemm/80/u8_i16832gemm_u8/cutlass_tensorop_u8_i16832gemm_u8_256x128_64x3_n32t32_align16.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_256x128_8x5_tt_align1.cu.o [ 20%] Built target cutlass_library_gemm_sm80_tf32_s1688gemm_tf32_objs [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/all_sm80_z884gemm_gemm_operations.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_nn_align1.cu.o [ 20%] Built target cutlass_library_gemm_sm80_u8_i16832gemm_u8_objs [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832fastaccumgemm_e4m3_objs.dir/generated/gemm/89/s16832fastaccumgemm_e4m3/all_sm89_s16832fastaccumgemm_e4m3_gemm_operations.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832fastaccumgemm_e4m3_objs.dir/generated/gemm/89/s16832fastaccumgemm_e4m3/cutlass_tensorop_s16832fastaccumgemm_e4m3_256x128_64x3_tn_align16.cu.o [ 20%] Built target cutlass_library_gemm_sm80_sgemm_objs [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832fastaccumgemm_e4m3_e5m2_objs.dir/generated/gemm/89/s16832fastaccumgemm_e4m3_e5m2/all_sm89_s16832fastaccumgemm_e4m3_e5m2_gemm_operations.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_cn_align1.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832fastaccumgemm_e4m3_e5m2_objs.dir/generated/gemm/89/s16832fastaccumgemm_e4m3_e5m2/cutlass_tensorop_s16832fastaccumgemm_e4m3_e5m2_256x128_64x3_tn_align16.cu.o [ 20%] Built target cutlass_library_gemm_sm89_s16832fastaccumgemm_e4m3_objs [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832fastaccumgemm_e5m2_objs.dir/generated/gemm/89/s16832fastaccumgemm_e5m2/all_sm89_s16832fastaccumgemm_e5m2_gemm_operations.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_nc_align1.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832fastaccumgemm_e5m2_objs.dir/generated/gemm/89/s16832fastaccumgemm_e5m2/cutlass_tensorop_s16832fastaccumgemm_e5m2_256x128_64x3_tn_align16.cu.o [ 20%] Built target cutlass_library_gemm_sm89_s16832fastaccumgemm_e4m3_e5m2_objs [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832fastaccumgemm_e5m2_e4m3_objs.dir/generated/gemm/89/s16832fastaccumgemm_e5m2_e4m3/all_sm89_s16832fastaccumgemm_e5m2_e4m3_gemm_operations.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832fastaccumgemm_e5m2_e4m3_objs.dir/generated/gemm/89/s16832fastaccumgemm_e5m2_e4m3/cutlass_tensorop_s16832fastaccumgemm_e5m2_e4m3_256x128_64x3_tn_align16.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_cc_align1.cu.o [ 20%] Built target cutlass_library_gemm_sm89_s16832fastaccumgemm_e5m2_objs [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832gemm_e4m3_objs.dir/generated/gemm/89/s16832gemm_e4m3/all_sm89_s16832gemm_e4m3_gemm_operations.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832gemm_e4m3_objs.dir/generated/gemm/89/s16832gemm_e4m3/cutlass_tensorop_s16832gemm_e4m3_256x128_64x3_tn_align16.cu.o [ 20%] Built target cutlass_library_gemm_sm89_s16832fastaccumgemm_e5m2_e4m3_objs [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832gemm_e4m3_e5m2_objs.dir/generated/gemm/89/s16832gemm_e4m3_e5m2/all_sm89_s16832gemm_e4m3_e5m2_gemm_operations.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_nt_align1.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832gemm_e4m3_e5m2_objs.dir/generated/gemm/89/s16832gemm_e4m3_e5m2/cutlass_tensorop_s16832gemm_e4m3_e5m2_256x128_64x3_tn_align16.cu.o [ 20%] Built target cutlass_library_gemm_sm89_s16832gemm_e4m3_objs [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832gemm_e5m2_objs.dir/generated/gemm/89/s16832gemm_e5m2/all_sm89_s16832gemm_e5m2_gemm_operations.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832gemm_e5m2_objs.dir/generated/gemm/89/s16832gemm_e5m2/cutlass_tensorop_s16832gemm_e5m2_256x128_64x3_tn_align16.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_ct_align1.cu.o [ 21%] Built target cutlass_library_gemm_sm89_s16832gemm_e4m3_e5m2_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832gemm_e5m2_e4m3_objs.dir/generated/gemm/89/s16832gemm_e5m2_e4m3/all_sm89_s16832gemm_e5m2_e4m3_gemm_operations.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832gemm_e5m2_e4m3_objs.dir/generated/gemm/89/s16832gemm_e5m2_e4m3/cutlass_tensorop_s16832gemm_e5m2_e4m3_256x128_64x3_tn_align16.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_fp_mixed_input.cu.o [ 21%] Built target cutlass_library_gemm_sm89_s16832gemm_e5m2_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864fastaccumspgemm_e4m3_objs.dir/generated/gemm/89/s16864fastaccumspgemm_e4m3/all_sm89_s16864fastaccumspgemm_e4m3_gemm_operations.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_nh_align1.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864fastaccumspgemm_e4m3_objs.dir/generated/gemm/89/s16864fastaccumspgemm_e4m3/cutlass_tensorop_s16864fastaccumspgemm_e4m3_128x64_128x3_tn_align16.cu.o [ 21%] Built target cutlass_library_gemm_sm89_s16832gemm_e5m2_e4m3_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864fastaccumspgemm_e4m3_e5m2_objs.dir/generated/gemm/89/s16864fastaccumspgemm_e4m3_e5m2/all_sm89_s16864fastaccumspgemm_e4m3_e5m2_gemm_operations.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864fastaccumspgemm_e4m3_e5m2_objs.dir/generated/gemm/89/s16864fastaccumspgemm_e4m3_e5m2/cutlass_tensorop_s16864fastaccumspgemm_e4m3_e5m2_128x64_128x3_tn_align16.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_ch_align1.cu.o [ 21%] Built target cutlass_library_gemm_sm89_s16864fastaccumspgemm_e4m3_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864fastaccumspgemm_e5m2_objs.dir/generated/gemm/89/s16864fastaccumspgemm_e5m2/all_sm89_s16864fastaccumspgemm_e5m2_gemm_operations.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864fastaccumspgemm_e5m2_objs.dir/generated/gemm/89/s16864fastaccumspgemm_e5m2/cutlass_tensorop_s16864fastaccumspgemm_e5m2_128x64_128x3_tn_align16.cu.o [ 21%] Built target cutlass_library_gemm_sm89_s16864fastaccumspgemm_e4m3_e5m2_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864fastaccumspgemm_e5m2_e4m3_objs.dir/generated/gemm/89/s16864fastaccumspgemm_e5m2_e4m3/all_sm89_s16864fastaccumspgemm_e5m2_e4m3_gemm_operations.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_tn_align1.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864fastaccumspgemm_e5m2_e4m3_objs.dir/generated/gemm/89/s16864fastaccumspgemm_e5m2_e4m3/cutlass_tensorop_s16864fastaccumspgemm_e5m2_e4m3_128x64_128x3_tn_align16.cu.o [ 21%] Built target cutlass_library_gemm_sm89_s16864fastaccumspgemm_e5m2_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864spgemm_e4m3_objs.dir/generated/gemm/89/s16864spgemm_e4m3/all_sm89_s16864spgemm_e4m3_gemm_operations.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_hn_align1.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864spgemm_e4m3_objs.dir/generated/gemm/89/s16864spgemm_e4m3/cutlass_tensorop_s16864spgemm_e4m3_128x64_128x3_tn_align16.cu.o [ 21%] Built target cutlass_library_gemm_sm89_s16864fastaccumspgemm_e5m2_e4m3_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864spgemm_e4m3_e5m2_objs.dir/generated/gemm/89/s16864spgemm_e4m3_e5m2/all_sm89_s16864spgemm_e4m3_e5m2_gemm_operations.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864spgemm_e4m3_e5m2_objs.dir/generated/gemm/89/s16864spgemm_e4m3_e5m2/cutlass_tensorop_s16864spgemm_e4m3_e5m2_128x64_128x3_tn_align16.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_tc_align1.cu.o [ 21%] Built target cutlass_library_gemm_sm89_s16864spgemm_e4m3_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864spgemm_e5m2_objs.dir/generated/gemm/89/s16864spgemm_e5m2/all_sm89_s16864spgemm_e5m2_gemm_operations.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864spgemm_e5m2_objs.dir/generated/gemm/89/s16864spgemm_e5m2/cutlass_tensorop_s16864spgemm_e5m2_128x64_128x3_tn_align16.cu.o [ 21%] Built target cutlass_library_gemm_sm89_s16864spgemm_e4m3_e5m2_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_hc_align1.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_tt_align1.cu.o [ 21%] Built target cutlass_library_gemm_sm89_s16864spgemm_e5m2_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864spgemm_e5m2_e4m3_objs.dir/generated/gemm/89/s16864spgemm_e5m2_e4m3/all_sm89_s16864spgemm_e5m2_e4m3_gemm_operations.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_ht_align1.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864spgemm_e5m2_e4m3_objs.dir/generated/gemm/89/s16864spgemm_e5m2_e4m3/cutlass_tensorop_s16864spgemm_e5m2_e4m3_128x64_128x3_tn_align16.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_th_align1.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_hh_align1.cu.o [ 21%] Built target cutlass_library_gemm_sm89_s16864spgemm_e5m2_e4m3_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/all_sm90_bf16_s64x128x16gemm_bf16_gemm_operations.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 21%] Built target cutlass_library_gemm_sm80_z884gemm_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/all_sm90_bf16_s64x128x32gemm_e4m3_gemm_operations.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/initialize_reference_operations.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reduction/reduction_device.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reduction/init_reduction_operations.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/conv2d.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/conv3d.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 22%] Building CXX object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/initialize_all.cpp.o [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/all_gemm_operations.cu.o [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv2d/all_conv2d_operations.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv3d/all_conv3d_operations.cu.o [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/rank_k/all_rank_k_operations.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/rank_2k/all_rank_2k_operations.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ttn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/trmm/all_trmm_operations.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/symm/all_symm_operations.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ttn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 24%] Built target cutlass_library_objs [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/all_sm90_bf16_s64x128x32gemm_e4m3_e5m2_gemm_operations.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ttn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ttn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ttn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_nnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ntn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_tnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ttn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_nnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ntn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 25%] Built target cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/all_sm90_bf16_s64x128x32gemm_e5m2_gemm_operations.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 27%] Built target cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/all_sm90_bf16_s64x128x32gemm_e5m2_e4m3_gemm_operations.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 29%] Built target cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_d1684gemm_objs.dir/generated/gemm/90/d1684gemm/all_sm90_d1684gemm_gemm_operations.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_d1684gemm_objs.dir/generated/gemm/90/d1684gemm/cutlass_sm90_tensorop_d1684gemm_f64_f64_f64_f64_f64_128x128x16_1x1x1_3_nnn_align1.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_d1684gemm_objs.dir/generated/gemm/90/d1684gemm/cutlass_sm90_tensorop_d1684gemm_f64_f64_f64_f64_f64_128x128x16_1x1x1_3_ntn_align1.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_d1684gemm_objs.dir/generated/gemm/90/d1684gemm/cutlass_sm90_tensorop_d1684gemm_f64_f64_f64_f64_f64_128x128x16_1x1x1_3_tnn_align1.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_d1684gemm_objs.dir/generated/gemm/90/d1684gemm/cutlass_sm90_tensorop_d1684gemm_f64_f64_f64_f64_f64_128x128x16_1x1x1_3_ttn_align1.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 29%] Built target cutlass_library_gemm_sm90_d1684gemm_objs [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/all_sm90_f16_s64x128x16gemm_f16_gemm_operations.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 30%] Built target cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/all_sm90_f16_s64x128x32gemm_e4m3_gemm_operations.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 31%] Built target cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/all_sm90_f16_s64x128x32gemm_e4m3_e5m2_gemm_operations.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_ttn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_ttn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Built target cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/all_sm90_f16_s64x128x32gemm_e5m2_gemm_operations.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_ttn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_ttn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_ttn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_nnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_ntn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_tnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_ttn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_nnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_ntn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 35%] Built target cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/all_sm90_f16_s64x128x32gemm_e5m2_e4m3_gemm_operations.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 37%] Built target cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/all_sm90_gz1684gemm_gemm_operations.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_nnn_align1.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_cnn_align1.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_ncn_align1.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_ccn_align1.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_ntn_align1.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_ctn_align1.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_nhn_align1.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_chn_align1.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_tnn_align1.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_hnn_align1.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_tcn_align1.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_hcn_align1.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_ttn_align1.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_htn_align1.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_thn_align1.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_hhn_align1.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 37%] Built target cutlass_library_gemm_sm90_gz1684gemm_objs [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/all_sm90_h64x128x16gemm_gemm_operations.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 37%] Built target cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/all_sm90_i64x128x32gemm_s8_gemm_operations.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = int; StrideC_ = cute::tuple, long int, long int>; ElementD_ = int; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = int; StrideC_ = cute::tuple, long int, long int>; ElementD_ = int; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = int; StrideC_ = cute::tuple, long int, long int>; ElementD_ = int; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Built target cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/all_sm90_i64x128x32gemm_u8_gemm_operations.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = int; StrideC_ = cute::tuple, long int, long int>; ElementD_ = int; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = int; StrideC_ = cute::tuple, long int, long int>; ElementD_ = int; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Built target cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/all_sm90_s64x128x16gemm_bf16_gemm_operations.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = int; StrideC_ = cute::tuple, long int, long int>; ElementD_ = int; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 39%] Built target cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/all_sm90_s64x128x16gemm_f16_gemm_operations.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_ttn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_ttn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_ttn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_ttn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_ttn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_nnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_ntn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_tnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_ttn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_nnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_ntn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 41%] Built target cutlass_library_gemm_sm90_h64x128x16gemm_objs [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/all_sm90_s64x128x32gemm_e4m3_gemm_operations.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_ttn_align4_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_ttn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_ttn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_ttn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ttn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_ttn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ttn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_ttn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 44%] Built target cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/all_sm90_s64x128x32gemm_e4m3_e5m2_gemm_operations.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ttn_align2_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ttn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ttn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ttn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 44%] Built target cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/all_sm90_s64x128x32gemm_e5m2_gemm_operations.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 47%] Built target cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/all_sm90_s64x128x32gemm_e5m2_e4m3_gemm_operations.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 47%] Built target cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/all_sm90_s64x128x8gemm_tf32_gemm_operations.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_tma.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_tma.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_tma.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_tma.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_cooperative_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_tma.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_256x128x32_1x2x1_0_tnn_align4_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_256x128x32_1x2x1_0_nnn_align4_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_256x128x32_1x2x1_0_ntn_align4_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_ttn_align4_warpspecialized_pingpong.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_ttn_align4_warpspecialized_cooperative.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_ttn_align4_warpspecialized_pingpong.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_256x128x32_1x2x1_0_ttn_align4_warpspecialized_cooperative.cu.o [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_1x1x1_0_tnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_1x1x1_0_nnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_1x1x1_0_ntn_align2_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_1x1x1_0_tnn_align1_cpasync_warpspecialized_epi_nosmem.cu.o [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_1x1x1_0_nnn_align1_cpasync_warpspecialized_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_1x1x1_0_ntn_align1_cpasync_warpspecialized_epi_nosmem.cu.o [ 50%] Built target cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/all_sm90_s64x128x8tf32gemm_gemm_operations.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_tma.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_1x1x1_0_ttn_align2_cpasync_warpspecialized.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_tma.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_1x1x1_0_ttn_align1_cpasync_warpspecialized.cu.o [ 50%] Built target cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/all_sm90_s8_i64x128x32gemm_s8_gemm_operations.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_tma.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_tma.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = signed char; StrideC_ = cute::tuple, long int, long int>; ElementD_ = signed char; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, signed char, cute::tuple, long int, long int>, signed char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, signed char, cute::tuple, long int, long int>, signed char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, signed char, cute::tuple, long int, long int>, signed char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, signed char>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, signed char>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = signed char; StrideC_ = cute::tuple, long int, long int>; ElementD_ = signed char; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, signed char, cute::tuple, long int, long int>, signed char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, signed char, cute::tuple, long int, long int>, signed char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, signed char, cute::tuple, long int, long int>, signed char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, signed char>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, signed char>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = signed char; StrideC_ = cute::tuple, long int, long int>; ElementD_ = signed char; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, signed char, cute::tuple, long int, long int>, signed char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, signed char, cute::tuple, long int, long int>, signed char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, signed char, cute::tuple, long int, long int>, signed char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, signed char>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, signed char>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_256x128x32_1x2x1_0_tnn_align4_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_256x128x32_1x2x1_0_nnn_align4_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_256x128x32_1x2x1_0_ntn_align4_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_ttn_align4_warpspecialized_pingpong.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 50%] Built target cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/all_sm90_s8_i64x128x32gemm_u8_gemm_operations.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_ttn_align4_warpspecialized_cooperative.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_ttn_align4_warpspecialized_pingpong.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = signed char; StrideC_ = cute::tuple, long int, long int>; ElementD_ = unsigned char; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, signed char, cute::tuple, long int, long int>, unsigned char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, signed char, cute::tuple, long int, long int>, unsigned char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, signed char, cute::tuple, long int, long int>, unsigned char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, unsigned char>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, unsigned char>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_256x128x32_1x2x1_0_ttn_align4_warpspecialized_cooperative.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_1x1x1_0_tnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = signed char; StrideC_ = cute::tuple, long int, long int>; ElementD_ = unsigned char; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, signed char, cute::tuple, long int, long int>, unsigned char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, signed char, cute::tuple, long int, long int>, unsigned char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, signed char, cute::tuple, long int, long int>, unsigned char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, unsigned char>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, unsigned char>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_1x1x1_0_nnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_1x1x1_0_ntn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = signed char; StrideC_ = cute::tuple, long int, long int>; ElementD_ = unsigned char; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, signed char, cute::tuple, long int, long int>, unsigned char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, signed char, cute::tuple, long int, long int>, unsigned char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, signed char, cute::tuple, long int, long int>, unsigned char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, unsigned char>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, unsigned char>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_1x1x1_0_tnn_align1_cpasync_warpspecialized_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_1x1x1_0_nnn_align1_cpasync_warpspecialized_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_1x1x1_0_ntn_align1_cpasync_warpspecialized_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_1x1x1_0_ttn_align2_cpasync_warpspecialized.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_1x1x1_0_ttn_align1_cpasync_warpspecialized.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 52%] Built target cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_s8_objs.dir/generated/gemm/90/void_i64x128x32gemm_s8/all_sm90_void_i64x128x32gemm_s8_gemm_operations.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_s8_objs.dir/generated/gemm/90/void_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 52%] Built target cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_u8_objs.dir/generated/gemm/90/void_i64x128x32gemm_u8/all_sm90_void_i64x128x32gemm_u8_gemm_operations.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_u8_objs.dir/generated/gemm/90/void_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_u8_objs.dir/generated/gemm/90/void_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_nosmem.cu.o [ 52%] Built target cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/all_sm90_void_s64x128x16gemm_bf16_gemm_operations.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = int; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_s8_objs.dir/generated/gemm/90/void_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_u8_objs.dir/generated/gemm/90/void_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = int; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_u8_objs.dir/generated/gemm/90/void_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_s8_objs.dir/generated/gemm/90/void_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = int; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_u8_objs.dir/generated/gemm/90/void_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = int; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_u8_objs.dir/generated/gemm/90/void_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = int; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_s8_objs.dir/generated/gemm/90/void_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_s8_objs.dir/generated/gemm/90/void_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 52%] Built target cutlass_library_gemm_sm90_void_i64x128x32gemm_u8_objs [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/all_sm90_void_s64x128x16gemm_f16_gemm_operations.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = int; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_s8_objs.dir/generated/gemm/90/void_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 52%] Built target cutlass_library_gemm_sm90_void_i64x128x32gemm_s8_objs [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/void_s64x128x32gemm_e4m3/all_sm90_void_s64x128x32gemm_e4m3_gemm_operations.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/void_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/void_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/void_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/void_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Built target cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_objs [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/void_s64x128x32gemm_e4m3_e5m2/all_sm90_void_s64x128x32gemm_e4m3_e5m2_gemm_operations.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/void_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/void_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/void_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/void_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Built target cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2_objs [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/void_s64x128x32gemm_e5m2/all_sm90_void_s64x128x32gemm_e5m2_gemm_operations.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/void_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/void_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/void_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Built target cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/void_s64x128x32gemm_e5m2_e4m3/all_sm90_void_s64x128x32gemm_e5m2_e4m3_gemm_operations.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/void_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/void_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Built target cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_objs [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/all_sm90_z1684gemm_gemm_operations.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/void_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_nnn_align1.cu.o [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_cnn_align1.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/void_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_ncn_align1.cu.o [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_ccn_align1.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/void_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_ntn_align1.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_ctn_align1.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Built target cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3_objs [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_cf32_cdgrad_optimized_cf32_objs.dir/generated/conv2d/50/cf32_cdgrad_optimized_cf32/all_sm50_cf32_cdgrad_optimized_cf32_conv2d_operations.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_nhn_align1.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 54%] Built target cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_cf32_cfprop_optimized_cf32_objs.dir/generated/conv2d/50/cf32_cfprop_optimized_cf32/all_sm50_cf32_cfprop_optimized_cf32_conv2d_operations.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_cf32_cdgrad_optimized_cf32_objs.dir/generated/conv2d/50/cf32_cdgrad_optimized_cf32/cutlass_simt_cf32_cdgrad_optimized_cf32_128x64_8x2_nhwc_unity_stride_align1.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_cf32_cfprop_optimized_cf32_objs.dir/generated/conv2d/50/cf32_cfprop_optimized_cf32/cutlass_simt_cf32_cfprop_optimized_cf32_128x64_8x2_nhwc_align1.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_chn_align1.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_tnn_align1.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_cf32_cdgrad_optimized_cf32_objs.dir/generated/conv2d/50/cf32_cdgrad_optimized_cf32/cutlass_simt_cf32_cdgrad_optimized_cf32_128x64_8x2_nhwc_align1.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_hnn_align1.cu.o [ 54%] Built target cutlass_library_conv2d_sm50_cf32_cfprop_optimized_cf32_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_cf32_cwgrad_optimized_cf32_objs.dir/generated/conv2d/50/cf32_cwgrad_optimized_cf32/all_sm50_cf32_cwgrad_optimized_cf32_conv2d_operations.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_tcn_align1.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_cf32_cwgrad_optimized_cf32_objs.dir/generated/conv2d/50/cf32_cwgrad_optimized_cf32/cutlass_simt_cf32_cwgrad_optimized_cf32_128x64_8x2_nhwc_align1.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_hcn_align1.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_sdgrad_optimized_objs.dir/generated/conv2d/50/sdgrad_optimized/all_sm50_sdgrad_optimized_conv2d_operations.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_sdgrad_optimized_objs.dir/generated/conv2d/50/sdgrad_optimized/cutlass_simt_sdgrad_optimized_128x128_8x2_nhwc_unity_stride_align1.cu.o [ 54%] Built target cutlass_library_conv2d_sm50_cf32_cwgrad_optimized_cf32_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_sfprop_optimized_objs.dir/generated/conv2d/50/sfprop_optimized/all_sm50_sfprop_optimized_conv2d_operations.cu.o [ 54%] Built target cutlass_library_conv2d_sm50_cf32_cdgrad_optimized_cf32_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_swgrad_optimized_objs.dir/generated/conv2d/50/swgrad_optimized/all_sm50_swgrad_optimized_conv2d_operations.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_ttn_align1.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_sfprop_optimized_objs.dir/generated/conv2d/50/sfprop_optimized/cutlass_simt_sfprop_optimized_128x128_8x2_nhwc_align1.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_swgrad_optimized_objs.dir/generated/conv2d/50/swgrad_optimized/cutlass_simt_swgrad_optimized_128x128_8x2_nhwc_align1.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_sdgrad_optimized_objs.dir/generated/conv2d/50/sdgrad_optimized/cutlass_simt_sdgrad_optimized_128x128_8x2_nhwc_align1.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_htn_align1.cu.o [ 54%] Built target cutlass_library_conv2d_sm50_sfprop_optimized_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm60_hfprop_optimized_objs.dir/generated/conv2d/60/hfprop_optimized/all_sm60_hfprop_optimized_conv2d_operations.cu.o [ 54%] Built target cutlass_library_conv2d_sm50_swgrad_optimized_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_f16_s884dgrad_optimized_f16_objs.dir/generated/conv2d/70/f16_s884dgrad_optimized_f16/all_sm70_f16_s884dgrad_optimized_f16_conv2d_operations.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm60_hfprop_optimized_objs.dir/generated/conv2d/60/hfprop_optimized/cutlass_simt_hfprop_optimized_64x32x9_1x8x8x32_3_filter3x3_nhwc_depthwise_align8.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_f16_s884dgrad_optimized_f16_objs.dir/generated/conv2d/70/f16_s884dgrad_optimized_f16/cutlass_tensorop_f16_s884dgrad_optimized_f16_256x128_32x2_nhwc_unity_stride_align8.cu.o [ 54%] Built target cutlass_library_conv2d_sm50_sdgrad_optimized_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_f16_s884fprop_optimized_f16_objs.dir/generated/conv2d/70/f16_s884fprop_optimized_f16/all_sm70_f16_s884fprop_optimized_f16_conv2d_operations.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_thn_align1.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_f16_s884fprop_optimized_f16_objs.dir/generated/conv2d/70/f16_s884fprop_optimized_f16/cutlass_tensorop_f16_s884fprop_optimized_f16_256x128_32x2_nhwc_align8.cu.o [ 54%] Built target cutlass_library_conv2d_sm60_hfprop_optimized_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_f16_s884wgrad_optimized_f16_objs.dir/generated/conv2d/70/f16_s884wgrad_optimized_f16/all_sm70_f16_s884wgrad_optimized_f16_conv2d_operations.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_f16_s884dgrad_optimized_f16_objs.dir/generated/conv2d/70/f16_s884dgrad_optimized_f16/cutlass_tensorop_f16_s884dgrad_optimized_f16_256x128_32x2_nhwc_align8.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_f16_s884wgrad_optimized_f16_objs.dir/generated/conv2d/70/f16_s884wgrad_optimized_f16/cutlass_tensorop_f16_s884wgrad_optimized_f16_256x128_32x2_nhwc_align8.cu.o [ 54%] Built target cutlass_library_conv2d_sm70_f16_s884fprop_optimized_f16_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_h884dgrad_optimized_objs.dir/generated/conv2d/70/h884dgrad_optimized/all_sm70_h884dgrad_optimized_conv2d_operations.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_hhn_align1.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_h884dgrad_optimized_objs.dir/generated/conv2d/70/h884dgrad_optimized/cutlass_tensorop_h884dgrad_optimized_256x128_32x2_nhwc_unity_stride_align8.cu.o [ 54%] Built target cutlass_library_conv2d_sm70_f16_s884dgrad_optimized_f16_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_h884fprop_optimized_objs.dir/generated/conv2d/70/h884fprop_optimized/all_sm70_h884fprop_optimized_conv2d_operations.cu.o [ 54%] Built target cutlass_library_conv2d_sm70_f16_s884wgrad_optimized_f16_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_h884wgrad_optimized_objs.dir/generated/conv2d/70/h884wgrad_optimized/all_sm70_h884wgrad_optimized_conv2d_operations.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_h884fprop_optimized_objs.dir/generated/conv2d/70/h884fprop_optimized/cutlass_tensorop_h884fprop_optimized_256x128_32x2_nhwc_align8.cu.o [ 54%] Built target cutlass_library_gemm_sm90_z1684gemm_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_s884dgrad_optimized_f16_objs.dir/generated/conv2d/70/s884dgrad_optimized_f16/all_sm70_s884dgrad_optimized_f16_conv2d_operations.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_h884wgrad_optimized_objs.dir/generated/conv2d/70/h884wgrad_optimized/cutlass_tensorop_h884wgrad_optimized_256x128_32x2_nhwc_align8.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_h884dgrad_optimized_objs.dir/generated/conv2d/70/h884dgrad_optimized/cutlass_tensorop_h884dgrad_optimized_256x128_32x2_nhwc_align8.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_s884dgrad_optimized_f16_objs.dir/generated/conv2d/70/s884dgrad_optimized_f16/cutlass_tensorop_s884dgrad_optimized_f16_256x128_32x2_nhwc_unity_stride_align8.cu.o [ 54%] Built target cutlass_library_conv2d_sm70_h884fprop_optimized_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_s884fprop_optimized_f16_objs.dir/generated/conv2d/70/s884fprop_optimized_f16/all_sm70_s884fprop_optimized_f16_conv2d_operations.cu.o [ 54%] Built target cutlass_library_conv2d_sm70_h884wgrad_optimized_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_s884wgrad_optimized_f16_objs.dir/generated/conv2d/70/s884wgrad_optimized_f16/all_sm70_s884wgrad_optimized_f16_conv2d_operations.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_s884fprop_optimized_f16_objs.dir/generated/conv2d/70/s884fprop_optimized_f16/cutlass_tensorop_s884fprop_optimized_f16_256x128_32x2_nhwc_align8.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_s884wgrad_optimized_f16_objs.dir/generated/conv2d/70/s884wgrad_optimized_f16/cutlass_tensorop_s884wgrad_optimized_f16_256x128_32x2_nhwc_align8.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_s884dgrad_optimized_f16_objs.dir/generated/conv2d/70/s884dgrad_optimized_f16/cutlass_tensorop_s884dgrad_optimized_f16_256x128_32x2_nhwc_align8.cu.o [ 54%] Built target cutlass_library_conv2d_sm70_h884dgrad_optimized_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_cf32_cdgrad_optimized_cf32_objs.dir/generated/conv2d/75/cf32_cdgrad_optimized_cf32/all_sm75_cf32_cdgrad_optimized_cf32_conv2d_operations.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_cf32_cdgrad_optimized_cf32_objs.dir/generated/conv2d/75/cf32_cdgrad_optimized_cf32/cutlass_simt_cf32_cdgrad_optimized_cf32_128x128_8x5_nhwc_unity_stride_align1.cu.o [ 55%] Built target cutlass_library_conv2d_sm70_s884fprop_optimized_f16_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_cf32_cwgrad_optimized_cf32_objs.dir/generated/conv2d/75/cf32_cwgrad_optimized_cf32/all_sm75_cf32_cwgrad_optimized_cf32_conv2d_operations.cu.o [ 55%] Built target cutlass_library_conv2d_sm70_s884wgrad_optimized_f16_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_f16_s1688dgrad_optimized_f16_objs.dir/generated/conv2d/75/f16_s1688dgrad_optimized_f16/all_sm75_f16_s1688dgrad_optimized_f16_conv2d_operations.cu.o [ 55%] Built target cutlass_library_conv2d_sm70_s884dgrad_optimized_f16_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_f16_s1688fprop_few_channels_f16_objs.dir/generated/conv2d/75/f16_s1688fprop_few_channels_f16/all_sm75_f16_s1688fprop_few_channels_f16_conv2d_operations.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_cf32_cwgrad_optimized_cf32_objs.dir/generated/conv2d/75/cf32_cwgrad_optimized_cf32/cutlass_simt_cf32_cwgrad_optimized_cf32_128x128_8x5_nhwc_align1.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_f16_s1688dgrad_optimized_f16_objs.dir/generated/conv2d/75/f16_s1688dgrad_optimized_f16/cutlass_tensorop_f16_s1688dgrad_optimized_f16_256x128_32x2_nhwc_unity_stride_align8.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_f16_s1688fprop_few_channels_f16_objs.dir/generated/conv2d/75/f16_s1688fprop_few_channels_f16/cutlass_tensorop_f16_s1688fprop_few_channels_f16_128x64_32x2_nhwc_align1.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_cf32_cdgrad_optimized_cf32_objs.dir/generated/conv2d/75/cf32_cdgrad_optimized_cf32/cutlass_simt_cf32_cdgrad_optimized_cf32_128x128_8x5_nhwc_align1.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_f16_s1688dgrad_optimized_f16_objs.dir/generated/conv2d/75/f16_s1688dgrad_optimized_f16/cutlass_tensorop_f16_s1688dgrad_optimized_f16_256x128_32x2_nhwc_align8.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_cf32_cwgrad_optimized_cf32_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_f16_s1688fprop_fixed_channels_f16_objs.dir/generated/conv2d/75/f16_s1688fprop_fixed_channels_f16/all_sm75_f16_s1688fprop_fixed_channels_f16_conv2d_operations.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_f16_s1688fprop_few_channels_f16_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_f16_s1688fprop_optimized_f16_objs.dir/generated/conv2d/75/f16_s1688fprop_optimized_f16/all_sm75_f16_s1688fprop_optimized_f16_conv2d_operations.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_f16_s1688fprop_fixed_channels_f16_objs.dir/generated/conv2d/75/f16_s1688fprop_fixed_channels_f16/cutlass_tensorop_f16_s1688fprop_fixed_channels_f16_128x64_32x2_nhwc_align4.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_f16_s1688fprop_optimized_f16_objs.dir/generated/conv2d/75/f16_s1688fprop_optimized_f16/cutlass_tensorop_f16_s1688fprop_optimized_f16_256x128_32x2_nhwc_align8.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_f16_s1688dgrad_optimized_f16_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_f16_s1688wgrad_optimized_f16_objs.dir/generated/conv2d/75/f16_s1688wgrad_optimized_f16/all_sm75_f16_s1688wgrad_optimized_f16_conv2d_operations.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_cf32_cdgrad_optimized_cf32_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_h1688dgrad_optimized_objs.dir/generated/conv2d/75/h1688dgrad_optimized/all_sm75_h1688dgrad_optimized_conv2d_operations.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_f16_s1688wgrad_optimized_f16_objs.dir/generated/conv2d/75/f16_s1688wgrad_optimized_f16/cutlass_tensorop_f16_s1688wgrad_optimized_f16_256x128_32x2_nhwc_align8.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_h1688dgrad_optimized_objs.dir/generated/conv2d/75/h1688dgrad_optimized/cutlass_tensorop_h1688dgrad_optimized_256x128_32x2_nhwc_unity_stride_align8.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_f16_s1688fprop_fixed_channels_f16_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_h1688fprop_few_channels_objs.dir/generated/conv2d/75/h1688fprop_few_channels/all_sm75_h1688fprop_few_channels_conv2d_operations.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_f16_s1688fprop_optimized_f16_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_h1688fprop_fixed_channels_objs.dir/generated/conv2d/75/h1688fprop_fixed_channels/all_sm75_h1688fprop_fixed_channels_conv2d_operations.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_h1688fprop_few_channels_objs.dir/generated/conv2d/75/h1688fprop_few_channels/cutlass_tensorop_h1688fprop_few_channels_128x64_32x2_nhwc_align1.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_h1688fprop_fixed_channels_objs.dir/generated/conv2d/75/h1688fprop_fixed_channels/cutlass_tensorop_h1688fprop_fixed_channels_128x64_32x2_nhwc_align4.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_f16_s1688wgrad_optimized_f16_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_h1688fprop_optimized_objs.dir/generated/conv2d/75/h1688fprop_optimized/all_sm75_h1688fprop_optimized_conv2d_operations.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_h1688dgrad_optimized_objs.dir/generated/conv2d/75/h1688dgrad_optimized/cutlass_tensorop_h1688dgrad_optimized_256x128_32x2_nhwc_align8.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_h1688fprop_optimized_objs.dir/generated/conv2d/75/h1688fprop_optimized/cutlass_tensorop_h1688fprop_optimized_256x128_32x2_nhwc_align8.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_h1688fprop_fixed_channels_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_h1688wgrad_optimized_objs.dir/generated/conv2d/75/h1688wgrad_optimized/all_sm75_h1688wgrad_optimized_conv2d_operations.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_h1688fprop_few_channels_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_i8816fprop_optimized_s8_objs.dir/generated/conv2d/75/i8816fprop_optimized_s8/all_sm75_i8816fprop_optimized_s8_conv2d_operations.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_h1688wgrad_optimized_objs.dir/generated/conv2d/75/h1688wgrad_optimized/cutlass_tensorop_h1688wgrad_optimized_256x128_32x2_nhwc_align8.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_i8816fprop_optimized_s8_objs.dir/generated/conv2d/75/i8816fprop_optimized_s8/cutlass_tensorop_i8816fprop_optimized_s8_256x128_64x2_nhwc_align16.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_h1688dgrad_optimized_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_i8816fprop_optimized_u8_objs.dir/generated/conv2d/75/i8816fprop_optimized_u8/all_sm75_i8816fprop_optimized_u8_conv2d_operations.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_h1688fprop_optimized_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_i8832fprop_optimized_s4_objs.dir/generated/conv2d/75/i8832fprop_optimized_s4/all_sm75_i8832fprop_optimized_s4_conv2d_operations.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_i8816fprop_optimized_u8_objs.dir/generated/conv2d/75/i8816fprop_optimized_u8/cutlass_tensorop_i8816fprop_optimized_u8_256x128_64x2_nhwc_align16.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_i8832fprop_optimized_s4_objs.dir/generated/conv2d/75/i8832fprop_optimized_s4/cutlass_tensorop_i8832fprop_optimized_s4_256x128_128x2_nhwc_align32.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_h1688wgrad_optimized_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_i8832fprop_optimized_u4_objs.dir/generated/conv2d/75/i8832fprop_optimized_u4/all_sm75_i8832fprop_optimized_u4_conv2d_operations.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_i8816fprop_optimized_s8_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s1688dgrad_optimized_f16_objs.dir/generated/conv2d/75/s1688dgrad_optimized_f16/all_sm75_s1688dgrad_optimized_f16_conv2d_operations.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_i8832fprop_optimized_u4_objs.dir/generated/conv2d/75/i8832fprop_optimized_u4/cutlass_tensorop_i8832fprop_optimized_u4_256x128_128x2_nhwc_align32.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s1688dgrad_optimized_f16_objs.dir/generated/conv2d/75/s1688dgrad_optimized_f16/cutlass_tensorop_s1688dgrad_optimized_f16_256x128_32x2_nhwc_unity_stride_align8.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_i8816fprop_optimized_u8_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s1688fprop_few_channels_f16_objs.dir/generated/conv2d/75/s1688fprop_few_channels_f16/all_sm75_s1688fprop_few_channels_f16_conv2d_operations.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_i8832fprop_optimized_s4_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s1688fprop_fixed_channels_f16_objs.dir/generated/conv2d/75/s1688fprop_fixed_channels_f16/all_sm75_s1688fprop_fixed_channels_f16_conv2d_operations.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s1688fprop_few_channels_f16_objs.dir/generated/conv2d/75/s1688fprop_few_channels_f16/cutlass_tensorop_s1688fprop_few_channels_f16_128x64_32x2_nhwc_align1.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s1688fprop_fixed_channels_f16_objs.dir/generated/conv2d/75/s1688fprop_fixed_channels_f16/cutlass_tensorop_s1688fprop_fixed_channels_f16_128x64_32x2_nhwc_align4.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_i8832fprop_optimized_u4_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s1688fprop_optimized_f16_objs.dir/generated/conv2d/75/s1688fprop_optimized_f16/all_sm75_s1688fprop_optimized_f16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s1688dgrad_optimized_f16_objs.dir/generated/conv2d/75/s1688dgrad_optimized_f16/cutlass_tensorop_s1688dgrad_optimized_f16_256x128_32x2_nhwc_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s1688fprop_optimized_f16_objs.dir/generated/conv2d/75/s1688fprop_optimized_f16/cutlass_tensorop_s1688fprop_optimized_f16_256x128_32x2_nhwc_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_s1688fprop_fixed_channels_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s1688wgrad_optimized_f16_objs.dir/generated/conv2d/75/s1688wgrad_optimized_f16/all_sm75_s1688wgrad_optimized_f16_conv2d_operations.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_s1688fprop_few_channels_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s4_i8832fprop_optimized_s4_objs.dir/generated/conv2d/75/s4_i8832fprop_optimized_s4/all_sm75_s4_i8832fprop_optimized_s4_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s1688wgrad_optimized_f16_objs.dir/generated/conv2d/75/s1688wgrad_optimized_f16/cutlass_tensorop_s1688wgrad_optimized_f16_256x128_32x2_nhwc_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s4_i8832fprop_optimized_s4_objs.dir/generated/conv2d/75/s4_i8832fprop_optimized_s4/cutlass_tensorop_s4_i8832fprop_optimized_s4_256x128_128x2_nhwc_align32.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_s1688dgrad_optimized_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s8_i8816fprop_few_channels_s8_objs.dir/generated/conv2d/75/s8_i8816fprop_few_channels_s8/all_sm75_s8_i8816fprop_few_channels_s8_conv2d_operations.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_s1688fprop_optimized_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s8_i8816fprop_fixed_channels_s8_objs.dir/generated/conv2d/75/s8_i8816fprop_fixed_channels_s8/all_sm75_s8_i8816fprop_fixed_channels_s8_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s8_i8816fprop_few_channels_s8_objs.dir/generated/conv2d/75/s8_i8816fprop_few_channels_s8/cutlass_tensorop_s8_i8816fprop_few_channels_s8_256x128_64x2_nhwc_align16.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s8_i8816fprop_fixed_channels_s8_objs.dir/generated/conv2d/75/s8_i8816fprop_fixed_channels_s8/cutlass_tensorop_s8_i8816fprop_fixed_channels_s8_256x128_64x2_nhwc_align16.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_s1688wgrad_optimized_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s8_i8816fprop_optimized_s8_objs.dir/generated/conv2d/75/s8_i8816fprop_optimized_s8/all_sm75_s8_i8816fprop_optimized_s8_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s4_i8832fprop_optimized_s4_objs.dir/generated/conv2d/75/s4_i8832fprop_optimized_s4/cutlass_tensorop_s4_i8832fprop_optimized_s4_256x128_128x2_nc64hw64_align32.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s8_i8816fprop_optimized_s8_objs.dir/generated/conv2d/75/s8_i8816fprop_optimized_s8/cutlass_tensorop_s8_i8816fprop_optimized_s8_256x128_64x2_nhwc_align16.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_s8_i8816fprop_fixed_channels_s8_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_u4_i8832fprop_optimized_u4_objs.dir/generated/conv2d/75/u4_i8832fprop_optimized_u4/all_sm75_u4_i8832fprop_optimized_u4_conv2d_operations.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_s8_i8816fprop_few_channels_s8_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_u8_i8816fprop_few_channels_u8_objs.dir/generated/conv2d/75/u8_i8816fprop_few_channels_u8/all_sm75_u8_i8816fprop_few_channels_u8_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_u4_i8832fprop_optimized_u4_objs.dir/generated/conv2d/75/u4_i8832fprop_optimized_u4/cutlass_tensorop_u4_i8832fprop_optimized_u4_256x128_128x2_nhwc_align32.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_u8_i8816fprop_few_channels_u8_objs.dir/generated/conv2d/75/u8_i8816fprop_few_channels_u8/cutlass_tensorop_u8_i8816fprop_few_channels_u8_256x128_64x2_nhwc_align16.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_s4_i8832fprop_optimized_s4_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_u8_i8816fprop_fixed_channels_u8_objs.dir/generated/conv2d/75/u8_i8816fprop_fixed_channels_u8/all_sm75_u8_i8816fprop_fixed_channels_u8_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s8_i8816fprop_optimized_s8_objs.dir/generated/conv2d/75/s8_i8816fprop_optimized_s8/cutlass_tensorop_s8_i8816fprop_optimized_s8_256x128_64x2_nc32hw32_align16.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_u8_i8816fprop_fixed_channels_u8_objs.dir/generated/conv2d/75/u8_i8816fprop_fixed_channels_u8/cutlass_tensorop_u8_i8816fprop_fixed_channels_u8_256x128_64x2_nhwc_align16.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_u4_i8832fprop_optimized_u4_objs.dir/generated/conv2d/75/u4_i8832fprop_optimized_u4/cutlass_tensorop_u4_i8832fprop_optimized_u4_256x128_128x2_nc64hw64_align32.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_u8_i8816fprop_few_channels_u8_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_u8_i8816fprop_optimized_u8_objs.dir/generated/conv2d/75/u8_i8816fprop_optimized_u8/all_sm75_u8_i8816fprop_optimized_u8_conv2d_operations.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_s8_i8816fprop_optimized_s8_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_bf16_s16816dgrad_optimized_bf16_objs.dir/generated/conv2d/80/bf16_s16816dgrad_optimized_bf16/all_sm80_bf16_s16816dgrad_optimized_bf16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_u8_i8816fprop_optimized_u8_objs.dir/generated/conv2d/75/u8_i8816fprop_optimized_u8/cutlass_tensorop_u8_i8816fprop_optimized_u8_256x128_64x2_nhwc_align16.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_u8_i8816fprop_fixed_channels_u8_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_bf16_s16816fprop_fixed_channels_bf16_objs.dir/generated/conv2d/80/bf16_s16816fprop_fixed_channels_bf16/all_sm80_bf16_s16816fprop_fixed_channels_bf16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_bf16_s16816dgrad_optimized_bf16_objs.dir/generated/conv2d/80/bf16_s16816dgrad_optimized_bf16/cutlass_tensorop_bf16_s16816dgrad_optimized_bf16_256x128_32x3_nhwc_unity_stride_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_bf16_s16816fprop_fixed_channels_bf16_objs.dir/generated/conv2d/80/bf16_s16816fprop_fixed_channels_bf16/cutlass_tensorop_bf16_s16816fprop_fixed_channels_bf16_256x128_32x3_nhwc_align4.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_u4_i8832fprop_optimized_u4_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_bf16_s16816fprop_optimized_bf16_objs.dir/generated/conv2d/80/bf16_s16816fprop_optimized_bf16/all_sm80_bf16_s16816fprop_optimized_bf16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_bf16_s16816fprop_optimized_bf16_objs.dir/generated/conv2d/80/bf16_s16816fprop_optimized_bf16/cutlass_tensorop_bf16_s16816fprop_optimized_bf16_256x128_32x3_nhwc_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_u8_i8816fprop_optimized_u8_objs.dir/generated/conv2d/75/u8_i8816fprop_optimized_u8/cutlass_tensorop_u8_i8816fprop_optimized_u8_256x128_64x2_nc32hw32_align16.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_bf16_s16816dgrad_optimized_bf16_objs.dir/generated/conv2d/80/bf16_s16816dgrad_optimized_bf16/cutlass_tensorop_bf16_s16816dgrad_optimized_bf16_256x128_32x3_nhwc_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_bf16_s16816fprop_fixed_channels_bf16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_bf16_s16816wgrad_optimized_bf16_objs.dir/generated/conv2d/80/bf16_s16816wgrad_optimized_bf16/all_sm80_bf16_s16816wgrad_optimized_bf16_conv2d_operations.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_u8_i8816fprop_optimized_u8_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_f16_s16816dgrad_optimized_f16_objs.dir/generated/conv2d/80/f16_s16816dgrad_optimized_f16/all_sm80_f16_s16816dgrad_optimized_f16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_bf16_s16816fprop_optimized_bf16_objs.dir/generated/conv2d/80/bf16_s16816fprop_optimized_bf16/cutlass_tensorop_bf16_s16816fprop_optimized_bf16_256x128_32x3_nhwc_single_group_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_bf16_s16816wgrad_optimized_bf16_objs.dir/generated/conv2d/80/bf16_s16816wgrad_optimized_bf16/cutlass_tensorop_bf16_s16816wgrad_optimized_bf16_256x128_32x3_nhwc_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_f16_s16816dgrad_optimized_f16_objs.dir/generated/conv2d/80/f16_s16816dgrad_optimized_f16/cutlass_tensorop_f16_s16816dgrad_optimized_f16_256x128_32x3_nhwc_unity_stride_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_bf16_s16816dgrad_optimized_bf16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_f16_s16816fprop_fixed_channels_f16_objs.dir/generated/conv2d/80/f16_s16816fprop_fixed_channels_f16/all_sm80_f16_s16816fprop_fixed_channels_f16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_f16_s16816fprop_fixed_channels_f16_objs.dir/generated/conv2d/80/f16_s16816fprop_fixed_channels_f16/cutlass_tensorop_f16_s16816fprop_fixed_channels_f16_256x128_32x3_nhwc_align4.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_bf16_s16816fprop_optimized_bf16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_f16_s16816fprop_optimized_f16_objs.dir/generated/conv2d/80/f16_s16816fprop_optimized_f16/all_sm80_f16_s16816fprop_optimized_f16_conv2d_operations.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_bf16_s16816wgrad_optimized_bf16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_f16_s16816wgrad_optimized_f16_objs.dir/generated/conv2d/80/f16_s16816wgrad_optimized_f16/all_sm80_f16_s16816wgrad_optimized_f16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_f16_s16816dgrad_optimized_f16_objs.dir/generated/conv2d/80/f16_s16816dgrad_optimized_f16/cutlass_tensorop_f16_s16816dgrad_optimized_f16_256x128_32x3_nhwc_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_f16_s16816fprop_optimized_f16_objs.dir/generated/conv2d/80/f16_s16816fprop_optimized_f16/cutlass_tensorop_f16_s16816fprop_optimized_f16_256x128_32x3_nhwc_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_f16_s16816wgrad_optimized_f16_objs.dir/generated/conv2d/80/f16_s16816wgrad_optimized_f16/cutlass_tensorop_f16_s16816wgrad_optimized_f16_256x128_32x3_nhwc_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_f16_s16816fprop_fixed_channels_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_h16816dgrad_optimized_objs.dir/generated/conv2d/80/h16816dgrad_optimized/all_sm80_h16816dgrad_optimized_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_h16816dgrad_optimized_objs.dir/generated/conv2d/80/h16816dgrad_optimized/cutlass_tensorop_h16816dgrad_optimized_256x128_32x3_nhwc_unity_stride_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_f16_s16816fprop_optimized_f16_objs.dir/generated/conv2d/80/f16_s16816fprop_optimized_f16/cutlass_tensorop_f16_s16816fprop_optimized_f16_256x128_32x3_nhwc_single_group_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_f16_s16816dgrad_optimized_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_h16816fprop_fixed_channels_objs.dir/generated/conv2d/80/h16816fprop_fixed_channels/all_sm80_h16816fprop_fixed_channels_conv2d_operations.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_f16_s16816wgrad_optimized_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_h16816fprop_optimized_objs.dir/generated/conv2d/80/h16816fprop_optimized/all_sm80_h16816fprop_optimized_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_h16816fprop_fixed_channels_objs.dir/generated/conv2d/80/h16816fprop_fixed_channels/cutlass_tensorop_h16816fprop_fixed_channels_256x128_32x3_nhwc_align4.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_h16816fprop_optimized_objs.dir/generated/conv2d/80/h16816fprop_optimized/cutlass_tensorop_h16816fprop_optimized_256x128_32x3_nhwc_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_h16816dgrad_optimized_objs.dir/generated/conv2d/80/h16816dgrad_optimized/cutlass_tensorop_h16816dgrad_optimized_256x128_32x3_nhwc_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_f16_s16816fprop_optimized_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_h16816wgrad_optimized_objs.dir/generated/conv2d/80/h16816wgrad_optimized/all_sm80_h16816wgrad_optimized_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_h16816wgrad_optimized_objs.dir/generated/conv2d/80/h16816wgrad_optimized/cutlass_tensorop_h16816wgrad_optimized_256x128_32x3_nhwc_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_h16816fprop_fixed_channels_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_i16832fprop_optimized_s8_objs.dir/generated/conv2d/80/i16832fprop_optimized_s8/all_sm80_i16832fprop_optimized_s8_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_h16816fprop_optimized_objs.dir/generated/conv2d/80/h16816fprop_optimized/cutlass_tensorop_h16816fprop_optimized_256x128_32x3_nhwc_single_group_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_i16832fprop_optimized_s8_objs.dir/generated/conv2d/80/i16832fprop_optimized_s8/cutlass_tensorop_i16832fprop_optimized_s8_256x128_64x3_nhwc_align16.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_h16816dgrad_optimized_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_i16832fprop_optimized_u8_objs.dir/generated/conv2d/80/i16832fprop_optimized_u8/all_sm80_i16832fprop_optimized_u8_conv2d_operations.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_h16816wgrad_optimized_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_i16864fprop_optimized_s4_objs.dir/generated/conv2d/80/i16864fprop_optimized_s4/all_sm80_i16864fprop_optimized_s4_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_i16832fprop_optimized_u8_objs.dir/generated/conv2d/80/i16832fprop_optimized_u8/cutlass_tensorop_i16832fprop_optimized_u8_256x128_64x3_nhwc_align16.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_h16816fprop_optimized_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_i16864fprop_optimized_u4_objs.dir/generated/conv2d/80/i16864fprop_optimized_u4/all_sm80_i16864fprop_optimized_u4_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_i16864fprop_optimized_s4_objs.dir/generated/conv2d/80/i16864fprop_optimized_s4/cutlass_tensorop_i16864fprop_optimized_s4_256x128_128x3_nhwc_align32.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_i16864fprop_optimized_u4_objs.dir/generated/conv2d/80/i16864fprop_optimized_u4/cutlass_tensorop_i16864fprop_optimized_u4_256x128_128x3_nhwc_align32.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_i16832fprop_optimized_s8_objs.dir/generated/conv2d/80/i16832fprop_optimized_s8/cutlass_tensorop_i16832fprop_optimized_s8_256x128_64x3_nhwc_single_group_align16.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_i16832fprop_optimized_u8_objs.dir/generated/conv2d/80/i16832fprop_optimized_u8/cutlass_tensorop_i16832fprop_optimized_u8_256x128_64x3_nhwc_single_group_align16.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_i16864fprop_optimized_s4_objs.dir/generated/conv2d/80/i16864fprop_optimized_s4/cutlass_tensorop_i16864fprop_optimized_s4_256x128_128x3_nhwc_single_group_align32.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_i16864fprop_optimized_u4_objs.dir/generated/conv2d/80/i16864fprop_optimized_u4/cutlass_tensorop_i16864fprop_optimized_u4_256x128_128x3_nhwc_single_group_align32.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_i16832fprop_optimized_s8_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816dgrad_optimized_bf16_objs.dir/generated/conv2d/80/s16816dgrad_optimized_bf16/all_sm80_s16816dgrad_optimized_bf16_conv2d_operations.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_i16832fprop_optimized_u8_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816dgrad_optimized_f16_objs.dir/generated/conv2d/80/s16816dgrad_optimized_f16/all_sm80_s16816dgrad_optimized_f16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816dgrad_optimized_bf16_objs.dir/generated/conv2d/80/s16816dgrad_optimized_bf16/cutlass_tensorop_s16816dgrad_optimized_bf16_256x128_32x3_nhwc_unity_stride_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_i16864fprop_optimized_s4_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816fprop_fixed_channels_bf16_objs.dir/generated/conv2d/80/s16816fprop_fixed_channels_bf16/all_sm80_s16816fprop_fixed_channels_bf16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816dgrad_optimized_f16_objs.dir/generated/conv2d/80/s16816dgrad_optimized_f16/cutlass_tensorop_s16816dgrad_optimized_f16_256x128_32x3_nhwc_unity_stride_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_i16864fprop_optimized_u4_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816fprop_fixed_channels_f16_objs.dir/generated/conv2d/80/s16816fprop_fixed_channels_f16/all_sm80_s16816fprop_fixed_channels_f16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816fprop_fixed_channels_bf16_objs.dir/generated/conv2d/80/s16816fprop_fixed_channels_bf16/cutlass_tensorop_s16816fprop_fixed_channels_bf16_256x128_32x3_nhwc_align4.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816fprop_fixed_channels_f16_objs.dir/generated/conv2d/80/s16816fprop_fixed_channels_f16/cutlass_tensorop_s16816fprop_fixed_channels_f16_256x128_32x3_nhwc_align4.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816dgrad_optimized_bf16_objs.dir/generated/conv2d/80/s16816dgrad_optimized_bf16/cutlass_tensorop_s16816dgrad_optimized_bf16_256x128_32x3_nhwc_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816dgrad_optimized_f16_objs.dir/generated/conv2d/80/s16816dgrad_optimized_f16/cutlass_tensorop_s16816dgrad_optimized_f16_256x128_32x3_nhwc_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_s16816fprop_fixed_channels_bf16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816fprop_optimized_bf16_objs.dir/generated/conv2d/80/s16816fprop_optimized_bf16/all_sm80_s16816fprop_optimized_bf16_conv2d_operations.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_s16816fprop_fixed_channels_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816fprop_optimized_f16_objs.dir/generated/conv2d/80/s16816fprop_optimized_f16/all_sm80_s16816fprop_optimized_f16_conv2d_operations.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_s16816dgrad_optimized_bf16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816wgrad_optimized_bf16_objs.dir/generated/conv2d/80/s16816wgrad_optimized_bf16/all_sm80_s16816wgrad_optimized_bf16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816fprop_optimized_bf16_objs.dir/generated/conv2d/80/s16816fprop_optimized_bf16/cutlass_tensorop_s16816fprop_optimized_bf16_256x128_32x3_nhwc_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816fprop_optimized_f16_objs.dir/generated/conv2d/80/s16816fprop_optimized_f16/cutlass_tensorop_s16816fprop_optimized_f16_256x128_32x3_nhwc_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_s16816dgrad_optimized_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816wgrad_optimized_f16_objs.dir/generated/conv2d/80/s16816wgrad_optimized_f16/all_sm80_s16816wgrad_optimized_f16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816wgrad_optimized_bf16_objs.dir/generated/conv2d/80/s16816wgrad_optimized_bf16/cutlass_tensorop_s16816wgrad_optimized_bf16_256x128_32x3_nhwc_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816wgrad_optimized_f16_objs.dir/generated/conv2d/80/s16816wgrad_optimized_f16/cutlass_tensorop_s16816wgrad_optimized_f16_256x128_32x3_nhwc_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816fprop_optimized_bf16_objs.dir/generated/conv2d/80/s16816fprop_optimized_bf16/cutlass_tensorop_s16816fprop_optimized_bf16_256x128_32x3_nhwc_single_group_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816fprop_optimized_f16_objs.dir/generated/conv2d/80/s16816fprop_optimized_f16/cutlass_tensorop_s16816fprop_optimized_f16_256x128_32x3_nhwc_single_group_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_s16816wgrad_optimized_bf16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688bf16dgrad_optimized_objs.dir/generated/conv2d/80/s1688bf16dgrad_optimized/all_sm80_s1688bf16dgrad_optimized_conv2d_operations.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_s16816wgrad_optimized_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688bf16fprop_optimized_objs.dir/generated/conv2d/80/s1688bf16fprop_optimized/all_sm80_s1688bf16fprop_optimized_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688bf16dgrad_optimized_objs.dir/generated/conv2d/80/s1688bf16dgrad_optimized/cutlass_tensorop_s1688bf16dgrad_optimized_256x128_16x3_nhwc_unity_stride_align4.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_s16816fprop_optimized_bf16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688bf16wgrad_optimized_objs.dir/generated/conv2d/80/s1688bf16wgrad_optimized/all_sm80_s1688bf16wgrad_optimized_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688bf16fprop_optimized_objs.dir/generated/conv2d/80/s1688bf16fprop_optimized/cutlass_tensorop_s1688bf16fprop_optimized_256x128_16x3_nhwc_align4.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_s16816fprop_optimized_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688dgrad_optimized_objs.dir/generated/conv2d/80/s1688dgrad_optimized/all_sm80_s1688dgrad_optimized_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688bf16wgrad_optimized_objs.dir/generated/conv2d/80/s1688bf16wgrad_optimized/cutlass_tensorop_s1688bf16wgrad_optimized_256x128_16x3_nhwc_align4.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688dgrad_optimized_objs.dir/generated/conv2d/80/s1688dgrad_optimized/cutlass_tensorop_s1688dgrad_optimized_128x128_16x4_nhwc_unity_stride_align4.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688bf16dgrad_optimized_objs.dir/generated/conv2d/80/s1688bf16dgrad_optimized/cutlass_tensorop_s1688bf16dgrad_optimized_256x128_16x3_nhwc_align4.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688bf16fprop_optimized_objs.dir/generated/conv2d/80/s1688bf16fprop_optimized/cutlass_tensorop_s1688bf16fprop_optimized_256x128_16x3_nhwc_single_group_align4.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_s1688bf16wgrad_optimized_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688dgrad_optimized_tf32_objs.dir/generated/conv2d/80/s1688dgrad_optimized_tf32/all_sm80_s1688dgrad_optimized_tf32_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688dgrad_optimized_objs.dir/generated/conv2d/80/s1688dgrad_optimized/cutlass_tensorop_s1688dgrad_optimized_128x128_16x4_nhwc_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688dgrad_optimized_tf32_objs.dir/generated/conv2d/80/s1688dgrad_optimized_tf32/cutlass_tensorop_s1688dgrad_optimized_tf32_256x128_16x3_nhwc_unity_stride_align4.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688bf16dgrad_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688f16dgrad_optimized_objs.dir/generated/conv2d/80/s1688f16dgrad_optimized/all_sm80_s1688f16dgrad_optimized_conv2d_operations.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688bf16fprop_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688f16fprop_optimized_objs.dir/generated/conv2d/80/s1688f16fprop_optimized/all_sm80_s1688f16fprop_optimized_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688f16dgrad_optimized_objs.dir/generated/conv2d/80/s1688f16dgrad_optimized/cutlass_tensorop_s1688f16dgrad_optimized_256x128_16x3_nhwc_unity_stride_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688f16fprop_optimized_objs.dir/generated/conv2d/80/s1688f16fprop_optimized/cutlass_tensorop_s1688f16fprop_optimized_256x128_16x3_nhwc_align4.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688dgrad_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688f16wgrad_optimized_objs.dir/generated/conv2d/80/s1688f16wgrad_optimized/all_sm80_s1688f16wgrad_optimized_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688dgrad_optimized_tf32_objs.dir/generated/conv2d/80/s1688dgrad_optimized_tf32/cutlass_tensorop_s1688dgrad_optimized_tf32_256x128_16x3_nhwc_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688f16wgrad_optimized_objs.dir/generated/conv2d/80/s1688f16wgrad_optimized/cutlass_tensorop_s1688f16wgrad_optimized_256x128_16x3_nhwc_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688f16dgrad_optimized_objs.dir/generated/conv2d/80/s1688f16dgrad_optimized/cutlass_tensorop_s1688f16dgrad_optimized_256x128_16x3_nhwc_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688f16fprop_optimized_objs.dir/generated/conv2d/80/s1688f16fprop_optimized/cutlass_tensorop_s1688f16fprop_optimized_256x128_16x3_nhwc_single_group_align4.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688f16wgrad_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688fprop_optimized_objs.dir/generated/conv2d/80/s1688fprop_optimized/all_sm80_s1688fprop_optimized_conv2d_operations.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688dgrad_optimized_tf32_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688fprop_optimized_tf32_objs.dir/generated/conv2d/80/s1688fprop_optimized_tf32/all_sm80_s1688fprop_optimized_tf32_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688fprop_optimized_tf32_objs.dir/generated/conv2d/80/s1688fprop_optimized_tf32/cutlass_tensorop_s1688fprop_optimized_tf32_256x128_16x3_nhwc_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688fprop_optimized_objs.dir/generated/conv2d/80/s1688fprop_optimized/cutlass_tensorop_s1688fprop_optimized_128x128_16x4_nhwc_align4.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688f16dgrad_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688tf32dgrad_optimized_objs.dir/generated/conv2d/80/s1688tf32dgrad_optimized/all_sm80_s1688tf32dgrad_optimized_conv2d_operations.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688f16fprop_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688tf32fprop_optimized_objs.dir/generated/conv2d/80/s1688tf32fprop_optimized/all_sm80_s1688tf32fprop_optimized_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688tf32dgrad_optimized_objs.dir/generated/conv2d/80/s1688tf32dgrad_optimized/cutlass_tensorop_s1688tf32dgrad_optimized_256x128_16x3_nhwc_unity_stride_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688tf32fprop_optimized_objs.dir/generated/conv2d/80/s1688tf32fprop_optimized/cutlass_tensorop_s1688tf32fprop_optimized_256x128_16x3_nhwc_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688fprop_optimized_objs.dir/generated/conv2d/80/s1688fprop_optimized/cutlass_tensorop_s1688fprop_optimized_128x128_16x4_nhwc_single_group_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688fprop_optimized_tf32_objs.dir/generated/conv2d/80/s1688fprop_optimized_tf32/cutlass_tensorop_s1688fprop_optimized_tf32_256x128_16x3_nhwc_single_group_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688tf32dgrad_optimized_objs.dir/generated/conv2d/80/s1688tf32dgrad_optimized/cutlass_tensorop_s1688tf32dgrad_optimized_256x128_16x3_nhwc_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688tf32fprop_optimized_objs.dir/generated/conv2d/80/s1688tf32fprop_optimized/cutlass_tensorop_s1688tf32fprop_optimized_256x128_16x3_nhwc_single_group_align4.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688fprop_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688tf32wgrad_optimized_objs.dir/generated/conv2d/80/s1688tf32wgrad_optimized/all_sm80_s1688tf32wgrad_optimized_conv2d_operations.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688fprop_optimized_tf32_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688wgrad_optimized_objs.dir/generated/conv2d/80/s1688wgrad_optimized/all_sm80_s1688wgrad_optimized_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688tf32wgrad_optimized_objs.dir/generated/conv2d/80/s1688tf32wgrad_optimized/cutlass_tensorop_s1688tf32wgrad_optimized_256x128_16x3_nhwc_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688wgrad_optimized_objs.dir/generated/conv2d/80/s1688wgrad_optimized/cutlass_tensorop_s1688wgrad_optimized_128x128_16x4_nhwc_align4.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688tf32fprop_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688wgrad_optimized_tf32_objs.dir/generated/conv2d/80/s1688wgrad_optimized_tf32/all_sm80_s1688wgrad_optimized_tf32_conv2d_operations.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688tf32dgrad_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s4_i16864fprop_optimized_s4_objs.dir/generated/conv2d/80/s4_i16864fprop_optimized_s4/all_sm80_s4_i16864fprop_optimized_s4_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688wgrad_optimized_tf32_objs.dir/generated/conv2d/80/s1688wgrad_optimized_tf32/cutlass_tensorop_s1688wgrad_optimized_tf32_256x128_16x3_nhwc_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s4_i16864fprop_optimized_s4_objs.dir/generated/conv2d/80/s4_i16864fprop_optimized_s4/cutlass_tensorop_s4_i16864fprop_optimized_s4_256x128_128x3_nhwc_align32.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688tf32wgrad_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s8_i16832fprop_few_channels_s8_objs.dir/generated/conv2d/80/s8_i16832fprop_few_channels_s8/all_sm80_s8_i16832fprop_few_channels_s8_conv2d_operations.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688wgrad_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s8_i16832fprop_fixed_channels_s8_objs.dir/generated/conv2d/80/s8_i16832fprop_fixed_channels_s8/all_sm80_s8_i16832fprop_fixed_channels_s8_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s8_i16832fprop_few_channels_s8_objs.dir/generated/conv2d/80/s8_i16832fprop_few_channels_s8/cutlass_tensorop_s8_i16832fprop_few_channels_s8_256x128_64x3_nhwc_align16.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s8_i16832fprop_fixed_channels_s8_objs.dir/generated/conv2d/80/s8_i16832fprop_fixed_channels_s8/cutlass_tensorop_s8_i16832fprop_fixed_channels_s8_256x128_64x3_nhwc_align16.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688wgrad_optimized_tf32_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s8_i16832fprop_optimized_s8_objs.dir/generated/conv2d/80/s8_i16832fprop_optimized_s8/all_sm80_s8_i16832fprop_optimized_s8_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s4_i16864fprop_optimized_s4_objs.dir/generated/conv2d/80/s4_i16864fprop_optimized_s4/cutlass_tensorop_s4_i16864fprop_optimized_s4_256x128_128x3_nhwc_single_group_align32.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s8_i16832fprop_optimized_s8_objs.dir/generated/conv2d/80/s8_i16832fprop_optimized_s8/cutlass_tensorop_s8_i16832fprop_optimized_s8_256x128_64x3_nhwc_align16.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s8_i16832fprop_few_channels_s8_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_sdgrad_optimized_objs.dir/generated/conv2d/80/sdgrad_optimized/all_sm80_sdgrad_optimized_conv2d_operations.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s8_i16832fprop_fixed_channels_s8_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_sfprop_optimized_objs.dir/generated/conv2d/80/sfprop_optimized/all_sm80_sfprop_optimized_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s4_i16864fprop_optimized_s4_objs.dir/generated/conv2d/80/s4_i16864fprop_optimized_s4/cutlass_tensorop_s4_i16864fprop_optimized_s4_256x128_128x3_nc64hw64_align32.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_sdgrad_optimized_objs.dir/generated/conv2d/80/sdgrad_optimized/cutlass_simt_sdgrad_optimized_256x128_8x5_nhwc_unity_stride_align1.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_sfprop_optimized_objs.dir/generated/conv2d/80/sfprop_optimized/cutlass_simt_sfprop_optimized_256x128_8x5_nhwc_align1.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s8_i16832fprop_optimized_s8_objs.dir/generated/conv2d/80/s8_i16832fprop_optimized_s8/cutlass_tensorop_s8_i16832fprop_optimized_s8_256x128_64x3_nhwc_single_group_align16.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s4_i16864fprop_optimized_s4_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_swgrad_optimized_objs.dir/generated/conv2d/80/swgrad_optimized/all_sm80_swgrad_optimized_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s8_i16832fprop_optimized_s8_objs.dir/generated/conv2d/80/s8_i16832fprop_optimized_s8/cutlass_tensorop_s8_i16832fprop_optimized_s8_256x128_64x3_nc32hw32_align16.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_swgrad_optimized_objs.dir/generated/conv2d/80/swgrad_optimized/cutlass_simt_swgrad_optimized_256x128_8x5_nhwc_align1.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_sdgrad_optimized_objs.dir/generated/conv2d/80/sdgrad_optimized/cutlass_simt_sdgrad_optimized_256x128_8x5_nhwc_align1.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_sfprop_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_tf32_s1688dgrad_optimized_tf32_objs.dir/generated/conv2d/80/tf32_s1688dgrad_optimized_tf32/all_sm80_tf32_s1688dgrad_optimized_tf32_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_tf32_s1688dgrad_optimized_tf32_objs.dir/generated/conv2d/80/tf32_s1688dgrad_optimized_tf32/cutlass_tensorop_tf32_s1688dgrad_optimized_tf32_256x128_16x3_nhwc_unity_stride_align4.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s8_i16832fprop_optimized_s8_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_tf32_s1688fprop_optimized_tf32_objs.dir/generated/conv2d/80/tf32_s1688fprop_optimized_tf32/all_sm80_tf32_s1688fprop_optimized_tf32_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_tf32_s1688fprop_optimized_tf32_objs.dir/generated/conv2d/80/tf32_s1688fprop_optimized_tf32/cutlass_tensorop_tf32_s1688fprop_optimized_tf32_256x128_16x3_nhwc_align4.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_swgrad_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_tf32_s1688wgrad_optimized_tf32_objs.dir/generated/conv2d/80/tf32_s1688wgrad_optimized_tf32/all_sm80_tf32_s1688wgrad_optimized_tf32_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_tf32_s1688dgrad_optimized_tf32_objs.dir/generated/conv2d/80/tf32_s1688dgrad_optimized_tf32/cutlass_tensorop_tf32_s1688dgrad_optimized_tf32_256x128_16x3_nhwc_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_tf32_s1688wgrad_optimized_tf32_objs.dir/generated/conv2d/80/tf32_s1688wgrad_optimized_tf32/cutlass_tensorop_tf32_s1688wgrad_optimized_tf32_256x128_16x3_nhwc_align4.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_sdgrad_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_u4_i16864fprop_optimized_u4_objs.dir/generated/conv2d/80/u4_i16864fprop_optimized_u4/all_sm80_u4_i16864fprop_optimized_u4_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_tf32_s1688fprop_optimized_tf32_objs.dir/generated/conv2d/80/tf32_s1688fprop_optimized_tf32/cutlass_tensorop_tf32_s1688fprop_optimized_tf32_256x128_16x3_nhwc_single_group_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_u4_i16864fprop_optimized_u4_objs.dir/generated/conv2d/80/u4_i16864fprop_optimized_u4/cutlass_tensorop_u4_i16864fprop_optimized_u4_256x128_128x3_nhwc_align32.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_tf32_s1688wgrad_optimized_tf32_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_u8_i16832fprop_few_channels_u8_objs.dir/generated/conv2d/80/u8_i16832fprop_few_channels_u8/all_sm80_u8_i16832fprop_few_channels_u8_conv2d_operations.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_tf32_s1688dgrad_optimized_tf32_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_u8_i16832fprop_fixed_channels_u8_objs.dir/generated/conv2d/80/u8_i16832fprop_fixed_channels_u8/all_sm80_u8_i16832fprop_fixed_channels_u8_conv2d_operations.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_u8_i16832fprop_few_channels_u8_objs.dir/generated/conv2d/80/u8_i16832fprop_few_channels_u8/cutlass_tensorop_u8_i16832fprop_few_channels_u8_256x128_64x3_nhwc_align16.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_u8_i16832fprop_fixed_channels_u8_objs.dir/generated/conv2d/80/u8_i16832fprop_fixed_channels_u8/cutlass_tensorop_u8_i16832fprop_fixed_channels_u8_256x128_64x3_nhwc_align16.cu.o [ 58%] Built target cutlass_library_conv2d_sm80_tf32_s1688fprop_optimized_tf32_objs [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_u8_i16832fprop_optimized_u8_objs.dir/generated/conv2d/80/u8_i16832fprop_optimized_u8/all_sm80_u8_i16832fprop_optimized_u8_conv2d_operations.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_u4_i16864fprop_optimized_u4_objs.dir/generated/conv2d/80/u4_i16864fprop_optimized_u4/cutlass_tensorop_u4_i16864fprop_optimized_u4_256x128_128x3_nhwc_single_group_align32.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_u8_i16832fprop_optimized_u8_objs.dir/generated/conv2d/80/u8_i16832fprop_optimized_u8/cutlass_tensorop_u8_i16832fprop_optimized_u8_256x128_64x3_nhwc_align16.cu.o [ 58%] Built target cutlass_library_conv2d_sm80_u8_i16832fprop_few_channels_u8_objs [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm89_s16832fprop_fixed_channels_e4m3_objs.dir/generated/conv2d/89/s16832fprop_fixed_channels_e4m3/all_sm89_s16832fprop_fixed_channels_e4m3_conv2d_operations.cu.o [ 58%] Built target cutlass_library_conv2d_sm80_u8_i16832fprop_fixed_channels_u8_objs [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm89_s16832fprop_fixed_channels_e5m2_objs.dir/generated/conv2d/89/s16832fprop_fixed_channels_e5m2/all_sm89_s16832fprop_fixed_channels_e5m2_conv2d_operations.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_u4_i16864fprop_optimized_u4_objs.dir/generated/conv2d/80/u4_i16864fprop_optimized_u4/cutlass_tensorop_u4_i16864fprop_optimized_u4_256x128_128x3_nc64hw64_align32.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm89_s16832fprop_fixed_channels_e4m3_objs.dir/generated/conv2d/89/s16832fprop_fixed_channels_e4m3/cutlass_tensorop_s16832fprop_fixed_channels_e4m3_256x128_64x3_nhwc_align16.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_u8_i16832fprop_optimized_u8_objs.dir/generated/conv2d/80/u8_i16832fprop_optimized_u8/cutlass_tensorop_u8_i16832fprop_optimized_u8_256x128_64x3_nhwc_single_group_align16.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm89_s16832fprop_fixed_channels_e5m2_objs.dir/generated/conv2d/89/s16832fprop_fixed_channels_e5m2/cutlass_tensorop_s16832fprop_fixed_channels_e5m2_256x128_64x3_nhwc_align16.cu.o [ 58%] Built target cutlass_library_conv2d_sm80_u4_i16864fprop_optimized_u4_objs [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm89_s16832fprop_optimized_e4m3_objs.dir/generated/conv2d/89/s16832fprop_optimized_e4m3/all_sm89_s16832fprop_optimized_e4m3_conv2d_operations.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_u8_i16832fprop_optimized_u8_objs.dir/generated/conv2d/80/u8_i16832fprop_optimized_u8/cutlass_tensorop_u8_i16832fprop_optimized_u8_256x128_64x3_nc32hw32_align16.cu.o [ 58%] Built target cutlass_library_conv2d_sm89_s16832fprop_fixed_channels_e4m3_objs [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm89_s16832fprop_optimized_e5m2_objs.dir/generated/conv2d/89/s16832fprop_optimized_e5m2/all_sm89_s16832fprop_optimized_e5m2_conv2d_operations.cu.o [ 58%] Built target cutlass_library_conv2d_sm89_s16832fprop_fixed_channels_e5m2_objs [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16_objs.dir/generated/conv2d/90/h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16/all_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16_conv2d_operations.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm89_s16832fprop_optimized_e4m3_objs.dir/generated/conv2d/89/s16832fprop_optimized_e4m3/cutlass_tensorop_s16832fprop_optimized_e4m3_256x128_64x3_nhwc_align16.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm89_s16832fprop_optimized_e5m2_objs.dir/generated/conv2d/89/s16832fprop_optimized_e5m2/cutlass_tensorop_s16832fprop_optimized_e5m2_256x128_64x3_nhwc_align16.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16_objs.dir/generated/conv2d/90/h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16/cutlass3x_sm90_tensorop_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma.cu.o [ 58%] Built target cutlass_library_conv2d_sm80_u8_i16832fprop_optimized_u8_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16_objs.dir/generated/conv2d/90/h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16/all_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16_conv2d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm89_s16832fprop_optimized_e4m3_objs.dir/generated/conv2d/89/s16832fprop_optimized_e4m3/cutlass_tensorop_s16832fprop_optimized_e4m3_256x128_64x3_nhwc_single_group_align16.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16_objs.dir/generated/conv2d/90/h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16/cutlass3x_sm90_tensorop_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm89_s16832fprop_optimized_e5m2_objs.dir/generated/conv2d/89/s16832fprop_optimized_e5m2/cutlass_tensorop_s16832fprop_optimized_e5m2_256x128_64x3_nhwc_single_group_align16.cu.o [ 59%] Built target cutlass_library_conv2d_sm89_s16832fprop_optimized_e4m3_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32_objs.dir/generated/conv2d/90/s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32/all_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32_conv2d_operations.cu.o [ 59%] Built target cutlass_library_conv2d_sm89_s16832fprop_optimized_e5m2_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32_objs.dir/generated/conv2d/90/s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32/all_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32_conv2d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32_objs.dir/generated/conv2d/90/s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32_objs.dir/generated/conv2d/90/s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::SM75_U32x4_LDSM_N; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::SM90_U32x4_STSM_N]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16_objs.dir/generated/conv2d/90/h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16/cutlass3x_sm90_tensorop_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32_objs.dir/generated/conv2d/90/s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32_objs.dir/generated/conv2d/90/s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::SM75_U32x4_LDSM_N; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::SM90_U32x4_STSM_N]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16_objs.dir/generated/conv2d/90/h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16/cutlass3x_sm90_tensorop_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Built target cutlass_library_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32_objs.dir/generated/conv2d/90/s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32/all_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32_conv2d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32_objs.dir/generated/conv2d/90/s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::SM75_U32x4_LDSM_N; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::SM90_U32x4_STSM_N]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Built target cutlass_library_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32_objs.dir/generated/conv2d/90/s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32/all_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32_conv2d_operations.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Built target cutlass_library_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32_objs.dir/generated/conv2d/90/s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32_objs.dir/generated/conv2d/90/s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32_objs.dir/generated/conv2d/90/s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_bf16_s16816dgrad3d_analytic_bf16_objs.dir/generated/conv3d/80/bf16_s16816dgrad3d_analytic_bf16/all_sm80_bf16_s16816dgrad3d_analytic_bf16_conv3d_operations.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Built target cutlass_library_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_bf16_s16816dgrad3d_analytic_bf16_objs.dir/generated/conv3d/80/bf16_s16816dgrad3d_analytic_bf16/cutlass_tensorop_bf16_s16816dgrad3d_analytic_bf16_256x128_32x3.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_bf16_s16816dgrad3d_optimized_bf16_objs.dir/generated/conv3d/80/bf16_s16816dgrad3d_optimized_bf16/all_sm80_bf16_s16816dgrad3d_optimized_bf16_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_bf16_s16816dgrad3d_optimized_bf16_objs.dir/generated/conv3d/80/bf16_s16816dgrad3d_optimized_bf16/cutlass_tensorop_bf16_s16816dgrad3d_optimized_bf16_256x128_32x3_unity_stride.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_bf16_s16816dgrad3d_analytic_bf16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_bf16_s16816fprop3d_optimized_bf16_objs.dir/generated/conv3d/80/bf16_s16816fprop3d_optimized_bf16/all_sm80_bf16_s16816fprop3d_optimized_bf16_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_bf16_s16816fprop3d_optimized_bf16_objs.dir/generated/conv3d/80/bf16_s16816fprop3d_optimized_bf16/cutlass_tensorop_bf16_s16816fprop3d_optimized_bf16_256x128_32x3.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_bf16_s16816dgrad3d_optimized_bf16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_bf16_s16816wgrad3d_optimized_bf16_objs.dir/generated/conv3d/80/bf16_s16816wgrad3d_optimized_bf16/all_sm80_bf16_s16816wgrad3d_optimized_bf16_conv3d_operations.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::SM75_U32x4_LDSM_N; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::SM90_U32x4_STSM_N]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Built target cutlass_library_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_f16_s16816dgrad3d_analytic_f16_objs.dir/generated/conv3d/80/f16_s16816dgrad3d_analytic_f16/all_sm80_f16_s16816dgrad3d_analytic_f16_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_bf16_s16816wgrad3d_optimized_bf16_objs.dir/generated/conv3d/80/bf16_s16816wgrad3d_optimized_bf16/cutlass_tensorop_bf16_s16816wgrad3d_optimized_bf16_256x128_32x3.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_f16_s16816dgrad3d_analytic_f16_objs.dir/generated/conv3d/80/f16_s16816dgrad3d_analytic_f16/cutlass_tensorop_f16_s16816dgrad3d_analytic_f16_256x128_32x3.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Built target cutlass_library_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_f16_s16816dgrad3d_optimized_f16_objs.dir/generated/conv3d/80/f16_s16816dgrad3d_optimized_f16/all_sm80_f16_s16816dgrad3d_optimized_f16_conv3d_operations.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_bf16_s16816fprop3d_optimized_bf16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_f16_s16816fprop3d_optimized_f16_objs.dir/generated/conv3d/80/f16_s16816fprop3d_optimized_f16/all_sm80_f16_s16816fprop3d_optimized_f16_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_f16_s16816dgrad3d_optimized_f16_objs.dir/generated/conv3d/80/f16_s16816dgrad3d_optimized_f16/cutlass_tensorop_f16_s16816dgrad3d_optimized_f16_256x128_32x3_unity_stride.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_f16_s16816fprop3d_optimized_f16_objs.dir/generated/conv3d/80/f16_s16816fprop3d_optimized_f16/cutlass_tensorop_f16_s16816fprop3d_optimized_f16_256x128_32x3.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_bf16_s16816wgrad3d_optimized_bf16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_f16_s16816wgrad3d_optimized_f16_objs.dir/generated/conv3d/80/f16_s16816wgrad3d_optimized_f16/all_sm80_f16_s16816wgrad3d_optimized_f16_conv3d_operations.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_f16_s16816dgrad3d_analytic_f16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_h16816dgrad3d_analytic_objs.dir/generated/conv3d/80/h16816dgrad3d_analytic/all_sm80_h16816dgrad3d_analytic_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_f16_s16816wgrad3d_optimized_f16_objs.dir/generated/conv3d/80/f16_s16816wgrad3d_optimized_f16/cutlass_tensorop_f16_s16816wgrad3d_optimized_f16_256x128_32x3.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_h16816dgrad3d_analytic_objs.dir/generated/conv3d/80/h16816dgrad3d_analytic/cutlass_tensorop_h16816dgrad3d_analytic_256x128_32x3.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_f16_s16816dgrad3d_optimized_f16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_h16816dgrad3d_optimized_objs.dir/generated/conv3d/80/h16816dgrad3d_optimized/all_sm80_h16816dgrad3d_optimized_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_h16816dgrad3d_optimized_objs.dir/generated/conv3d/80/h16816dgrad3d_optimized/cutlass_tensorop_h16816dgrad3d_optimized_256x128_32x3_unity_stride.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_f16_s16816fprop3d_optimized_f16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_h16816fprop3d_optimized_objs.dir/generated/conv3d/80/h16816fprop3d_optimized/all_sm80_h16816fprop3d_optimized_conv3d_operations.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_f16_s16816wgrad3d_optimized_f16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_h16816wgrad3d_optimized_objs.dir/generated/conv3d/80/h16816wgrad3d_optimized/all_sm80_h16816wgrad3d_optimized_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_h16816fprop3d_optimized_objs.dir/generated/conv3d/80/h16816fprop3d_optimized/cutlass_tensorop_h16816fprop3d_optimized_256x128_32x3.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_h16816dgrad3d_analytic_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816dgrad3d_analytic_bf16_objs.dir/generated/conv3d/80/s16816dgrad3d_analytic_bf16/all_sm80_s16816dgrad3d_analytic_bf16_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_h16816wgrad3d_optimized_objs.dir/generated/conv3d/80/h16816wgrad3d_optimized/cutlass_tensorop_h16816wgrad3d_optimized_256x128_32x3.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_h16816dgrad3d_optimized_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816dgrad3d_analytic_f16_objs.dir/generated/conv3d/80/s16816dgrad3d_analytic_f16/all_sm80_s16816dgrad3d_analytic_f16_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816dgrad3d_analytic_bf16_objs.dir/generated/conv3d/80/s16816dgrad3d_analytic_bf16/cutlass_tensorop_s16816dgrad3d_analytic_bf16_256x128_32x3.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816dgrad3d_analytic_f16_objs.dir/generated/conv3d/80/s16816dgrad3d_analytic_f16/cutlass_tensorop_s16816dgrad3d_analytic_f16_256x128_32x3.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_h16816fprop3d_optimized_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816dgrad3d_optimized_bf16_objs.dir/generated/conv3d/80/s16816dgrad3d_optimized_bf16/all_sm80_s16816dgrad3d_optimized_bf16_conv3d_operations.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_h16816wgrad3d_optimized_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816dgrad3d_optimized_f16_objs.dir/generated/conv3d/80/s16816dgrad3d_optimized_f16/all_sm80_s16816dgrad3d_optimized_f16_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816dgrad3d_optimized_bf16_objs.dir/generated/conv3d/80/s16816dgrad3d_optimized_bf16/cutlass_tensorop_s16816dgrad3d_optimized_bf16_256x128_32x3_unity_stride.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816dgrad3d_optimized_f16_objs.dir/generated/conv3d/80/s16816dgrad3d_optimized_f16/cutlass_tensorop_s16816dgrad3d_optimized_f16_256x128_32x3_unity_stride.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_s16816dgrad3d_analytic_bf16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816fprop3d_optimized_bf16_objs.dir/generated/conv3d/80/s16816fprop3d_optimized_bf16/all_sm80_s16816fprop3d_optimized_bf16_conv3d_operations.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_s16816dgrad3d_analytic_f16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816fprop3d_optimized_f16_objs.dir/generated/conv3d/80/s16816fprop3d_optimized_f16/all_sm80_s16816fprop3d_optimized_f16_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816fprop3d_optimized_bf16_objs.dir/generated/conv3d/80/s16816fprop3d_optimized_bf16/cutlass_tensorop_s16816fprop3d_optimized_bf16_256x128_32x3.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816fprop3d_optimized_f16_objs.dir/generated/conv3d/80/s16816fprop3d_optimized_f16/cutlass_tensorop_s16816fprop3d_optimized_f16_256x128_32x3.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_s16816dgrad3d_optimized_bf16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816wgrad3d_optimized_bf16_objs.dir/generated/conv3d/80/s16816wgrad3d_optimized_bf16/all_sm80_s16816wgrad3d_optimized_bf16_conv3d_operations.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_s16816dgrad3d_optimized_f16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816wgrad3d_optimized_f16_objs.dir/generated/conv3d/80/s16816wgrad3d_optimized_f16/all_sm80_s16816wgrad3d_optimized_f16_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816wgrad3d_optimized_bf16_objs.dir/generated/conv3d/80/s16816wgrad3d_optimized_bf16/cutlass_tensorop_s16816wgrad3d_optimized_bf16_256x128_32x3.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816wgrad3d_optimized_f16_objs.dir/generated/conv3d/80/s16816wgrad3d_optimized_f16/cutlass_tensorop_s16816wgrad3d_optimized_f16_256x128_32x3.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_s16816fprop3d_optimized_bf16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16_objs.dir/generated/conv3d/90/h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16/all_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16_conv3d_operations.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_s16816fprop3d_optimized_f16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16_objs.dir/generated/conv3d/90/h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16/all_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16_objs.dir/generated/conv3d/90/h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16/cutlass3x_sm90_tensorop_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_s16816wgrad3d_optimized_bf16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32_objs.dir/generated/conv3d/90/s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32/all_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16_objs.dir/generated/conv3d/90/h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16/cutlass3x_sm90_tensorop_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_s16816wgrad3d_optimized_f16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32_objs.dir/generated/conv3d/90/s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32/all_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32_objs.dir/generated/conv3d/90/s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32_objs.dir/generated/conv3d/90/s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::SM75_U32x4_LDSM_N; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::SM90_U32x4_STSM_N]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16_objs.dir/generated/conv3d/90/h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16/cutlass3x_sm90_tensorop_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32_objs.dir/generated/conv3d/90/s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::SM75_U32x4_LDSM_N; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::SM90_U32x4_STSM_N]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Built target cutlass_library_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16_objs [ 59%] Built target cutlass_library_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32_objs.dir/generated/conv3d/90/s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32_objs.dir/generated/conv3d/90/s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32/all_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32_objs.dir/generated/conv3d/90/s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32_objs.dir/generated/conv3d/90/s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32_objs.dir/generated/conv3d/90/s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32/all_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32_conv3d_operations.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::SM75_U32x4_LDSM_N; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::SM90_U32x4_STSM_N]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32_objs.dir/generated/conv3d/90/s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16_objs.dir/generated/conv3d/90/h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16/cutlass3x_sm90_tensorop_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Built target cutlass_library_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688herk_objs.dir/generated/rank_k/80/c1688herk/all_sm80_c1688herk_rank_k_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688herk_objs.dir/generated/rank_k/80/c1688herk/cutlass_tensorop_c1688herk_128x64_16x4_n_l_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688herk_objs.dir/generated/rank_k/80/c1688herk/cutlass_tensorop_c1688herk_128x64_16x4_n_u_align1.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32_objs.dir/generated/conv3d/90/s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688herk_objs.dir/generated/rank_k/80/c1688herk/cutlass_tensorop_c1688herk_128x64_16x4_h_l_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688herk_objs.dir/generated/rank_k/80/c1688herk/cutlass_tensorop_c1688herk_128x64_16x4_h_u_align1.cu.o [ 59%] Built target cutlass_library_rank_k_sm80_c1688herk_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688syrk_objs.dir/generated/rank_k/80/c1688syrk/all_sm80_c1688syrk_rank_k_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688syrk_objs.dir/generated/rank_k/80/c1688syrk/cutlass_tensorop_c1688syrk_128x64_16x4_n_l_align1.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Built target cutlass_library_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688tf32herk_objs.dir/generated/rank_k/80/c1688tf32herk/all_sm80_c1688tf32herk_rank_k_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688tf32herk_objs.dir/generated/rank_k/80/c1688tf32herk/cutlass_tensorop_c1688tf32herk_128x64_16x4_n_l_align1.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::SM75_U32x4_LDSM_N; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::SM90_U32x4_STSM_N]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Built target cutlass_library_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688tf32syrk_objs.dir/generated/rank_k/80/c1688tf32syrk/all_sm80_c1688tf32syrk_rank_k_operations.cu.o [ 59%] Built target cutlass_library_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_d884syrk_objs.dir/generated/rank_k/80/d884syrk/all_sm80_d884syrk_rank_k_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688tf32syrk_objs.dir/generated/rank_k/80/c1688tf32syrk/cutlass_tensorop_c1688tf32syrk_128x64_16x4_n_l_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_d884syrk_objs.dir/generated/rank_k/80/d884syrk/cutlass_tensorop_d884syrk_128x128_16x3_n_l_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688syrk_objs.dir/generated/rank_k/80/c1688syrk/cutlass_tensorop_c1688syrk_128x64_16x4_n_u_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688tf32herk_objs.dir/generated/rank_k/80/c1688tf32herk/cutlass_tensorop_c1688tf32herk_128x64_16x4_n_u_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688tf32syrk_objs.dir/generated/rank_k/80/c1688tf32syrk/cutlass_tensorop_c1688tf32syrk_128x64_16x4_n_u_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_d884syrk_objs.dir/generated/rank_k/80/d884syrk/cutlass_tensorop_d884syrk_128x128_16x3_n_u_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688syrk_objs.dir/generated/rank_k/80/c1688syrk/cutlass_tensorop_c1688syrk_128x64_16x4_t_l_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688tf32herk_objs.dir/generated/rank_k/80/c1688tf32herk/cutlass_tensorop_c1688tf32herk_128x64_16x4_h_l_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688tf32syrk_objs.dir/generated/rank_k/80/c1688tf32syrk/cutlass_tensorop_c1688tf32syrk_128x64_16x4_t_l_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_d884syrk_objs.dir/generated/rank_k/80/d884syrk/cutlass_tensorop_d884syrk_128x128_16x3_t_l_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688syrk_objs.dir/generated/rank_k/80/c1688syrk/cutlass_tensorop_c1688syrk_128x64_16x4_t_u_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688tf32herk_objs.dir/generated/rank_k/80/c1688tf32herk/cutlass_tensorop_c1688tf32herk_128x64_16x4_h_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688tf32syrk_objs.dir/generated/rank_k/80/c1688tf32syrk/cutlass_tensorop_c1688tf32syrk_128x64_16x4_t_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_d884syrk_objs.dir/generated/rank_k/80/d884syrk/cutlass_tensorop_d884syrk_128x128_16x3_t_u_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm80_c1688syrk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_gz884herk_objs.dir/generated/rank_k/80/gz884herk/all_sm80_gz884herk_rank_k_operations.cu.o [ 60%] Built target cutlass_library_rank_k_sm80_c1688tf32herk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_gz884syrk_objs.dir/generated/rank_k/80/gz884syrk/all_sm80_gz884syrk_rank_k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_gz884herk_objs.dir/generated/rank_k/80/gz884herk/cutlass_tensorop_gz884herk_64x64_8x3_n_l_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm80_c1688tf32syrk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_s1688syrk_objs.dir/generated/rank_k/80/s1688syrk/all_sm80_s1688syrk_rank_k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_gz884syrk_objs.dir/generated/rank_k/80/gz884syrk/cutlass_tensorop_gz884syrk_64x64_8x3_n_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_s1688syrk_objs.dir/generated/rank_k/80/s1688syrk/cutlass_tensorop_s1688syrk_256x128_16x3_n_l_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm80_d884syrk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_s1688tf32syrk_objs.dir/generated/rank_k/80/s1688tf32syrk/all_sm80_s1688tf32syrk_rank_k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_gz884herk_objs.dir/generated/rank_k/80/gz884herk/cutlass_tensorop_gz884herk_64x64_8x3_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_gz884syrk_objs.dir/generated/rank_k/80/gz884syrk/cutlass_tensorop_gz884syrk_64x64_8x3_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_s1688tf32syrk_objs.dir/generated/rank_k/80/s1688tf32syrk/cutlass_tensorop_s1688tf32syrk_256x128_16x3_n_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_gz884syrk_objs.dir/generated/rank_k/80/gz884syrk/cutlass_tensorop_gz884syrk_64x64_8x3_t_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_gz884herk_objs.dir/generated/rank_k/80/gz884herk/cutlass_tensorop_gz884herk_64x64_8x3_h_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_gz884syrk_objs.dir/generated/rank_k/80/gz884syrk/cutlass_tensorop_gz884syrk_64x64_8x3_t_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_s1688syrk_objs.dir/generated/rank_k/80/s1688syrk/cutlass_tensorop_s1688syrk_256x128_16x3_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_gz884herk_objs.dir/generated/rank_k/80/gz884herk/cutlass_tensorop_gz884herk_64x64_8x3_h_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_s1688tf32syrk_objs.dir/generated/rank_k/80/s1688tf32syrk/cutlass_tensorop_s1688tf32syrk_256x128_16x3_n_u_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm80_gz884syrk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_z884herk_objs.dir/generated/rank_k/80/z884herk/all_sm80_z884herk_rank_k_operations.cu.o [ 60%] Built target cutlass_library_rank_k_sm80_gz884herk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_z884syrk_objs.dir/generated/rank_k/80/z884syrk/all_sm80_z884syrk_rank_k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_z884herk_objs.dir/generated/rank_k/80/z884herk/cutlass_tensorop_z884herk_128x64_8x3_n_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_z884syrk_objs.dir/generated/rank_k/80/z884syrk/cutlass_tensorop_z884syrk_128x64_8x3_n_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_s1688syrk_objs.dir/generated/rank_k/80/s1688syrk/cutlass_tensorop_s1688syrk_256x128_16x3_t_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_z884syrk_objs.dir/generated/rank_k/80/z884syrk/cutlass_tensorop_z884syrk_128x64_8x3_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_z884herk_objs.dir/generated/rank_k/80/z884herk/cutlass_tensorop_z884herk_128x64_8x3_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_s1688tf32syrk_objs.dir/generated/rank_k/80/s1688tf32syrk/cutlass_tensorop_s1688tf32syrk_256x128_16x3_t_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_z884syrk_objs.dir/generated/rank_k/80/z884syrk/cutlass_tensorop_z884syrk_128x64_8x3_t_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_z884herk_objs.dir/generated/rank_k/80/z884herk/cutlass_tensorop_z884herk_128x64_8x3_h_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_s1688tf32syrk_objs.dir/generated/rank_k/80/s1688tf32syrk/cutlass_tensorop_s1688tf32syrk_256x128_16x3_t_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_z884syrk_objs.dir/generated/rank_k/80/z884syrk/cutlass_tensorop_z884syrk_128x64_8x3_t_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_s1688syrk_objs.dir/generated/rank_k/80/s1688syrk/cutlass_tensorop_s1688syrk_256x128_16x3_t_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_z884herk_objs.dir/generated/rank_k/80/z884herk/cutlass_tensorop_z884herk_128x64_8x3_h_u_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm80_z884syrk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_d1684syrk_objs.dir/generated/rank_k/90/d1684syrk/all_sm90_d1684syrk_rank_k_operations.cu.o [ 60%] Built target cutlass_library_rank_k_sm80_z884herk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_gz1684herk_objs.dir/generated/rank_k/90/gz1684herk/all_sm90_gz1684herk_rank_k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_d1684syrk_objs.dir/generated/rank_k/90/d1684syrk/cutlass_tensorop_d1684syrk_128x128x16_1x1x1_3_n_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_gz1684herk_objs.dir/generated/rank_k/90/gz1684herk/cutlass_tensorop_gz1684herk_64x64x8_1x1x1_3_n_l_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm80_s1688tf32syrk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_gz1684syrk_objs.dir/generated/rank_k/90/gz1684syrk/all_sm90_gz1684syrk_rank_k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_gz1684syrk_objs.dir/generated/rank_k/90/gz1684syrk/cutlass_tensorop_gz1684syrk_64x64x8_1x1x1_3_n_l_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm80_s1688syrk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_z1684herk_objs.dir/generated/rank_k/90/z1684herk/all_sm90_z1684herk_rank_k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_gz1684herk_objs.dir/generated/rank_k/90/gz1684herk/cutlass_tensorop_gz1684herk_64x64x8_1x1x1_3_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_z1684herk_objs.dir/generated/rank_k/90/z1684herk/cutlass_tensorop_z1684herk_128x64x8_1x1x1_3_n_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_d1684syrk_objs.dir/generated/rank_k/90/d1684syrk/cutlass_tensorop_d1684syrk_128x128x16_1x1x1_3_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_gz1684syrk_objs.dir/generated/rank_k/90/gz1684syrk/cutlass_tensorop_gz1684syrk_64x64x8_1x1x1_3_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_gz1684herk_objs.dir/generated/rank_k/90/gz1684herk/cutlass_tensorop_gz1684herk_64x64x8_1x1x1_3_h_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_z1684herk_objs.dir/generated/rank_k/90/z1684herk/cutlass_tensorop_z1684herk_128x64x8_1x1x1_3_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_d1684syrk_objs.dir/generated/rank_k/90/d1684syrk/cutlass_tensorop_d1684syrk_128x128x16_1x1x1_3_t_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_gz1684syrk_objs.dir/generated/rank_k/90/gz1684syrk/cutlass_tensorop_gz1684syrk_64x64x8_1x1x1_3_t_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_gz1684herk_objs.dir/generated/rank_k/90/gz1684herk/cutlass_tensorop_gz1684herk_64x64x8_1x1x1_3_h_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_z1684herk_objs.dir/generated/rank_k/90/z1684herk/cutlass_tensorop_z1684herk_128x64x8_1x1x1_3_h_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_gz1684syrk_objs.dir/generated/rank_k/90/gz1684syrk/cutlass_tensorop_gz1684syrk_64x64x8_1x1x1_3_t_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_d1684syrk_objs.dir/generated/rank_k/90/d1684syrk/cutlass_tensorop_d1684syrk_128x128x16_1x1x1_3_t_u_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm90_gz1684herk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_z1684syrk_objs.dir/generated/rank_k/90/z1684syrk/all_sm90_z1684syrk_rank_k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_z1684syrk_objs.dir/generated/rank_k/90/z1684syrk/cutlass_tensorop_z1684syrk_128x64x8_1x1x1_3_n_l_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm90_gz1684syrk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688her2k_objs.dir/generated/rank_2k/80/c1688her2k/all_sm80_c1688her2k_rank_2k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_z1684herk_objs.dir/generated/rank_k/90/z1684herk/cutlass_tensorop_z1684herk_128x64x8_1x1x1_3_h_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688her2k_objs.dir/generated/rank_2k/80/c1688her2k/cutlass_tensorop_c1688her2k_128x64_16x4_n_l_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm90_d1684syrk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688syr2k_objs.dir/generated/rank_2k/80/c1688syr2k/all_sm80_c1688syr2k_rank_2k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_z1684syrk_objs.dir/generated/rank_k/90/z1684syrk/cutlass_tensorop_z1684syrk_128x64x8_1x1x1_3_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688syr2k_objs.dir/generated/rank_2k/80/c1688syr2k/cutlass_tensorop_c1688syr2k_128x64_16x4_n_l_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm90_z1684herk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688tf32her2k_objs.dir/generated/rank_2k/80/c1688tf32her2k/all_sm80_c1688tf32her2k_rank_2k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688tf32her2k_objs.dir/generated/rank_2k/80/c1688tf32her2k/cutlass_tensorop_c1688tf32her2k_128x64_16x4_n_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_z1684syrk_objs.dir/generated/rank_k/90/z1684syrk/cutlass_tensorop_z1684syrk_128x64x8_1x1x1_3_t_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688her2k_objs.dir/generated/rank_2k/80/c1688her2k/cutlass_tensorop_c1688her2k_128x64_16x4_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688syr2k_objs.dir/generated/rank_2k/80/c1688syr2k/cutlass_tensorop_c1688syr2k_128x64_16x4_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_z1684syrk_objs.dir/generated/rank_k/90/z1684syrk/cutlass_tensorop_z1684syrk_128x64x8_1x1x1_3_t_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688tf32her2k_objs.dir/generated/rank_2k/80/c1688tf32her2k/cutlass_tensorop_c1688tf32her2k_128x64_16x4_n_u_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm90_z1684syrk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688tf32syr2k_objs.dir/generated/rank_2k/80/c1688tf32syr2k/all_sm80_c1688tf32syr2k_rank_2k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688tf32syr2k_objs.dir/generated/rank_2k/80/c1688tf32syr2k/cutlass_tensorop_c1688tf32syr2k_128x64_16x4_n_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688syr2k_objs.dir/generated/rank_2k/80/c1688syr2k/cutlass_tensorop_c1688syr2k_128x64_16x4_t_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688her2k_objs.dir/generated/rank_2k/80/c1688her2k/cutlass_tensorop_c1688her2k_128x64_16x4_h_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688tf32her2k_objs.dir/generated/rank_2k/80/c1688tf32her2k/cutlass_tensorop_c1688tf32her2k_128x64_16x4_h_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688tf32syr2k_objs.dir/generated/rank_2k/80/c1688tf32syr2k/cutlass_tensorop_c1688tf32syr2k_128x64_16x4_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688syr2k_objs.dir/generated/rank_2k/80/c1688syr2k/cutlass_tensorop_c1688syr2k_128x64_16x4_t_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688tf32her2k_objs.dir/generated/rank_2k/80/c1688tf32her2k/cutlass_tensorop_c1688tf32her2k_128x64_16x4_h_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688her2k_objs.dir/generated/rank_2k/80/c1688her2k/cutlass_tensorop_c1688her2k_128x64_16x4_h_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688tf32syr2k_objs.dir/generated/rank_2k/80/c1688tf32syr2k/cutlass_tensorop_c1688tf32syr2k_128x64_16x4_t_l_align1.cu.o [ 60%] Built target cutlass_library_rank_2k_sm80_c1688syr2k_objs [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_d884syr2k_objs.dir/generated/rank_2k/80/d884syr2k/all_sm80_d884syr2k_rank_2k_operations.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_d884syr2k_objs.dir/generated/rank_2k/80/d884syr2k/cutlass_tensorop_d884syr2k_128x128_16x3_n_l_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688tf32syr2k_objs.dir/generated/rank_2k/80/c1688tf32syr2k/cutlass_tensorop_c1688tf32syr2k_128x64_16x4_t_u_align1.cu.o [ 61%] Built target cutlass_library_rank_2k_sm80_c1688tf32her2k_objs [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_gz884her2k_objs.dir/generated/rank_2k/80/gz884her2k/all_sm80_gz884her2k_rank_2k_operations.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_gz884her2k_objs.dir/generated/rank_2k/80/gz884her2k/cutlass_tensorop_gz884her2k_64x64_8x3_n_l_align1.cu.o [ 61%] Built target cutlass_library_rank_2k_sm80_c1688her2k_objs [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_gz884syr2k_objs.dir/generated/rank_2k/80/gz884syr2k/all_sm80_gz884syr2k_rank_2k_operations.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_gz884syr2k_objs.dir/generated/rank_2k/80/gz884syr2k/cutlass_tensorop_gz884syr2k_64x64_8x3_n_l_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_d884syr2k_objs.dir/generated/rank_2k/80/d884syr2k/cutlass_tensorop_d884syr2k_128x128_16x3_n_u_align1.cu.o [ 61%] Built target cutlass_library_rank_2k_sm80_c1688tf32syr2k_objs [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_s1688syr2k_objs.dir/generated/rank_2k/80/s1688syr2k/all_sm80_s1688syr2k_rank_2k_operations.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_gz884her2k_objs.dir/generated/rank_2k/80/gz884her2k/cutlass_tensorop_gz884her2k_64x64_8x3_n_u_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_s1688syr2k_objs.dir/generated/rank_2k/80/s1688syr2k/cutlass_tensorop_s1688syr2k_256x128_16x3_n_l_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_gz884syr2k_objs.dir/generated/rank_2k/80/gz884syr2k/cutlass_tensorop_gz884syr2k_64x64_8x3_n_u_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_d884syr2k_objs.dir/generated/rank_2k/80/d884syr2k/cutlass_tensorop_d884syr2k_128x128_16x3_t_l_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_gz884her2k_objs.dir/generated/rank_2k/80/gz884her2k/cutlass_tensorop_gz884her2k_64x64_8x3_h_l_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_gz884syr2k_objs.dir/generated/rank_2k/80/gz884syr2k/cutlass_tensorop_gz884syr2k_64x64_8x3_t_l_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_gz884her2k_objs.dir/generated/rank_2k/80/gz884her2k/cutlass_tensorop_gz884her2k_64x64_8x3_h_u_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_gz884syr2k_objs.dir/generated/rank_2k/80/gz884syr2k/cutlass_tensorop_gz884syr2k_64x64_8x3_t_u_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_d884syr2k_objs.dir/generated/rank_2k/80/d884syr2k/cutlass_tensorop_d884syr2k_128x128_16x3_t_u_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_s1688syr2k_objs.dir/generated/rank_2k/80/s1688syr2k/cutlass_tensorop_s1688syr2k_256x128_16x3_n_u_align1.cu.o [ 61%] Built target cutlass_library_rank_2k_sm80_gz884her2k_objs [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_s1688tf32syr2k_objs.dir/generated/rank_2k/80/s1688tf32syr2k/all_sm80_s1688tf32syr2k_rank_2k_operations.cu.o [ 61%] Built target cutlass_library_rank_2k_sm80_gz884syr2k_objs [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_z884her2k_objs.dir/generated/rank_2k/80/z884her2k/all_sm80_z884her2k_rank_2k_operations.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_s1688tf32syr2k_objs.dir/generated/rank_2k/80/s1688tf32syr2k/cutlass_tensorop_s1688tf32syr2k_256x128_16x3_n_l_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_z884her2k_objs.dir/generated/rank_2k/80/z884her2k/cutlass_tensorop_z884her2k_128x64_8x3_n_l_align1.cu.o [ 62%] Built target cutlass_library_rank_2k_sm80_d884syr2k_objs [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_z884syr2k_objs.dir/generated/rank_2k/80/z884syr2k/all_sm80_z884syr2k_rank_2k_operations.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_z884syr2k_objs.dir/generated/rank_2k/80/z884syr2k/cutlass_tensorop_z884syr2k_128x64_8x3_n_l_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_s1688syr2k_objs.dir/generated/rank_2k/80/s1688syr2k/cutlass_tensorop_s1688syr2k_256x128_16x3_t_l_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_z884her2k_objs.dir/generated/rank_2k/80/z884her2k/cutlass_tensorop_z884her2k_128x64_8x3_n_u_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_z884syr2k_objs.dir/generated/rank_2k/80/z884syr2k/cutlass_tensorop_z884syr2k_128x64_8x3_n_u_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_s1688tf32syr2k_objs.dir/generated/rank_2k/80/s1688tf32syr2k/cutlass_tensorop_s1688tf32syr2k_256x128_16x3_n_u_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_z884syr2k_objs.dir/generated/rank_2k/80/z884syr2k/cutlass_tensorop_z884syr2k_128x64_8x3_t_l_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_z884her2k_objs.dir/generated/rank_2k/80/z884her2k/cutlass_tensorop_z884her2k_128x64_8x3_h_l_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_z884syr2k_objs.dir/generated/rank_2k/80/z884syr2k/cutlass_tensorop_z884syr2k_128x64_8x3_t_u_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_s1688syr2k_objs.dir/generated/rank_2k/80/s1688syr2k/cutlass_tensorop_s1688syr2k_256x128_16x3_t_u_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_z884her2k_objs.dir/generated/rank_2k/80/z884her2k/cutlass_tensorop_z884her2k_128x64_8x3_h_u_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_s1688tf32syr2k_objs.dir/generated/rank_2k/80/s1688tf32syr2k/cutlass_tensorop_s1688tf32syr2k_256x128_16x3_t_l_align1.cu.o [ 62%] Built target cutlass_library_rank_2k_sm80_z884syr2k_objs [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_d1684syr2k_objs.dir/generated/rank_2k/90/d1684syr2k/all_sm90_d1684syr2k_rank_2k_operations.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_d1684syr2k_objs.dir/generated/rank_2k/90/d1684syr2k/cutlass_tensorop_d1684syr2k_128x128x16_1x1x1_3_n_l_align1.cu.o [ 62%] Built target cutlass_library_rank_2k_sm80_z884her2k_objs [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_gz1684her2k_objs.dir/generated/rank_2k/90/gz1684her2k/all_sm90_gz1684her2k_rank_2k_operations.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_gz1684her2k_objs.dir/generated/rank_2k/90/gz1684her2k/cutlass_tensorop_gz1684her2k_64x64x8_1x1x1_3_n_l_align1.cu.o [ 62%] Built target cutlass_library_rank_2k_sm80_s1688syr2k_objs [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_gz1684syr2k_objs.dir/generated/rank_2k/90/gz1684syr2k/all_sm90_gz1684syr2k_rank_2k_operations.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_d1684syr2k_objs.dir/generated/rank_2k/90/d1684syr2k/cutlass_tensorop_d1684syr2k_128x128x16_1x1x1_3_n_u_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_gz1684syr2k_objs.dir/generated/rank_2k/90/gz1684syr2k/cutlass_tensorop_gz1684syr2k_64x64x8_1x1x1_3_n_l_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_s1688tf32syr2k_objs.dir/generated/rank_2k/80/s1688tf32syr2k/cutlass_tensorop_s1688tf32syr2k_256x128_16x3_t_u_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_gz1684her2k_objs.dir/generated/rank_2k/90/gz1684her2k/cutlass_tensorop_gz1684her2k_64x64x8_1x1x1_3_n_u_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_gz1684syr2k_objs.dir/generated/rank_2k/90/gz1684syr2k/cutlass_tensorop_gz1684syr2k_64x64x8_1x1x1_3_n_u_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_d1684syr2k_objs.dir/generated/rank_2k/90/d1684syr2k/cutlass_tensorop_d1684syr2k_128x128x16_1x1x1_3_t_l_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_gz1684her2k_objs.dir/generated/rank_2k/90/gz1684her2k/cutlass_tensorop_gz1684her2k_64x64x8_1x1x1_3_h_l_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_gz1684syr2k_objs.dir/generated/rank_2k/90/gz1684syr2k/cutlass_tensorop_gz1684syr2k_64x64x8_1x1x1_3_t_l_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_d1684syr2k_objs.dir/generated/rank_2k/90/d1684syr2k/cutlass_tensorop_d1684syr2k_128x128x16_1x1x1_3_t_u_align1.cu.o [ 62%] Built target cutlass_library_rank_2k_sm80_s1688tf32syr2k_objs [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_z1684her2k_objs.dir/generated/rank_2k/90/z1684her2k/all_sm90_z1684her2k_rank_2k_operations.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_gz1684her2k_objs.dir/generated/rank_2k/90/gz1684her2k/cutlass_tensorop_gz1684her2k_64x64x8_1x1x1_3_h_u_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_gz1684syr2k_objs.dir/generated/rank_2k/90/gz1684syr2k/cutlass_tensorop_gz1684syr2k_64x64x8_1x1x1_3_t_u_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_z1684her2k_objs.dir/generated/rank_2k/90/z1684her2k/cutlass_tensorop_z1684her2k_128x64x8_1x1x1_3_n_l_align1.cu.o [ 62%] Built target cutlass_library_rank_2k_sm90_gz1684syr2k_objs [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_z1684syr2k_objs.dir/generated/rank_2k/90/z1684syr2k/all_sm90_z1684syr2k_rank_2k_operations.cu.o [ 62%] Built target cutlass_library_rank_2k_sm90_d1684syr2k_objs [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/all_sm80_c1688tf32trmm_trmm_operations.cu.o [ 62%] Built target cutlass_library_rank_2k_sm90_gz1684her2k_objs [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/all_sm80_c1688trmm_trmm_operations.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_z1684syr2k_objs.dir/generated/rank_2k/90/z1684syr2k/cutlass_tensorop_z1684syr2k_128x64x8_1x1x1_3_n_l_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_nn_ls_l_nu_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_nn_ls_l_nu_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_z1684her2k_objs.dir/generated/rank_2k/90/z1684her2k/cutlass_tensorop_z1684her2k_128x64x8_1x1x1_3_n_u_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_cn_ls_l_nu_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_cn_ls_l_nu_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_z1684syr2k_objs.dir/generated/rank_2k/90/z1684syr2k/cutlass_tensorop_z1684syr2k_128x64x8_1x1x1_3_n_u_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_nn_ls_l_un_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_z1684her2k_objs.dir/generated/rank_2k/90/z1684her2k/cutlass_tensorop_z1684her2k_128x64x8_1x1x1_3_h_l_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_nn_ls_l_un_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_z1684syr2k_objs.dir/generated/rank_2k/90/z1684syr2k/cutlass_tensorop_z1684syr2k_128x64x8_1x1x1_3_t_l_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_cn_ls_l_un_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_cn_ls_l_un_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_z1684her2k_objs.dir/generated/rank_2k/90/z1684her2k/cutlass_tensorop_z1684her2k_128x64x8_1x1x1_3_h_u_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_z1684syr2k_objs.dir/generated/rank_2k/90/z1684syr2k/cutlass_tensorop_z1684syr2k_128x64x8_1x1x1_3_t_u_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_nn_ls_u_nu_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_nn_ls_u_nu_align1.cu.o [ 63%] Built target cutlass_library_rank_2k_sm90_z1684syr2k_objs [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/all_sm80_d884trmm_trmm_operations.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_cn_ls_u_nu_align1.cu.o [ 63%] Built target cutlass_library_rank_2k_sm90_z1684her2k_objs [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/all_sm80_gz884trmm_trmm_operations.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_nn_ls_l_nu_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_cn_ls_u_nu_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_nn_ls_l_nu_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_nn_ls_u_un_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_nn_ls_l_un_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_cn_ls_l_nu_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_nn_ls_u_un_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_cn_ls_u_un_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_nn_ls_l_un_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_nn_ls_u_nu_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_cn_ls_u_un_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_nn_rs_l_nu_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_cn_ls_l_un_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_nn_ls_u_un_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_nn_rs_l_nu_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_cn_rs_l_nu_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_nn_ls_u_nu_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_nn_rs_l_nu_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_nn_rs_l_un_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_cn_ls_u_nu_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_cn_rs_l_nu_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_nn_rs_l_un_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_nn_ls_u_un_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_cn_rs_l_un_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_nn_rs_l_un_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_nn_rs_u_nu_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_cn_ls_u_un_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_nn_rs_u_nu_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_cn_rs_l_un_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_nn_rs_u_un_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_nn_rs_l_nu_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_cn_rs_u_nu_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_nn_rs_u_nu_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_tn_ls_l_nu_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_cn_rs_l_nu_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_nn_rs_u_un_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_nn_rs_l_un_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_tn_ls_l_un_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_cn_rs_u_nu_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_cn_rs_u_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_cn_rs_l_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_tn_ls_u_nu_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_nn_rs_u_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_tn_ls_l_nu_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_nn_rs_u_nu_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_tn_ls_u_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_cn_rs_u_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_cn_rs_u_nu_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_hn_ls_l_nu_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_tn_rs_l_nu_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_nn_rs_u_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_tn_ls_l_nu_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_tn_ls_l_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_tn_rs_l_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_cn_rs_u_un_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_hn_ls_l_nu_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_hn_ls_l_un_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_tn_rs_u_nu_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_tn_ls_l_nu_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_tn_ls_u_nu_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_tn_ls_l_un_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_tn_rs_u_un_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_hn_ls_l_nu_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_hn_ls_u_nu_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_hn_ls_l_un_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_tn_ls_l_un_align1.cu.o [ 66%] Built target cutlass_library_trmm_sm80_d884trmm_objs [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/all_sm80_s1688tf32trmm_trmm_operations.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_nn_ls_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_tn_ls_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_hn_ls_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_tn_ls_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_nn_ls_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_hn_ls_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_tn_ls_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_hn_ls_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_hn_ls_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_tn_rs_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_nn_ls_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_tn_ls_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_tn_ls_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_hn_rs_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_nn_ls_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_hn_ls_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_hn_ls_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_tn_rs_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_nn_rs_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_tn_rs_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_tn_rs_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_hn_rs_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_hn_rs_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_nn_rs_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_hn_rs_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_tn_rs_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_tn_rs_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_nn_rs_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_tn_rs_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_hn_rs_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_hn_rs_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_nn_rs_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_tn_rs_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_tn_rs_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_hn_rs_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_hn_rs_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_tn_ls_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_hn_rs_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_tn_rs_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_tn_rs_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_tn_ls_l_un_align1.cu.o [ 67%] Built target cutlass_library_trmm_sm80_c1688tf32trmm_objs [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/all_sm80_s1688trmm_trmm_operations.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_hn_rs_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_hn_rs_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_nn_ls_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_tn_ls_u_nu_align1.cu.o [ 67%] Built target cutlass_library_trmm_sm80_gz884trmm_objs [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/all_sm80_z884trmm_trmm_operations.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_tn_rs_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_nn_ls_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_nn_ls_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_tn_ls_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_hn_rs_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_cn_ls_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_nn_ls_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_tn_rs_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_nn_ls_l_un_align1.cu.o [ 67%] Built target cutlass_library_trmm_sm80_c1688trmm_objs [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/all_sm90_d1684trmm_trmm_operations.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_nn_ls_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_nn_ls_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_tn_rs_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_cn_ls_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_nn_ls_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_nn_rs_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_nn_ls_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_tn_rs_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_nn_ls_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_nn_rs_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_cn_ls_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_tn_rs_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_nn_ls_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_nn_ls_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_nn_rs_u_nu_align1.cu.o [ 68%] Built target cutlass_library_trmm_sm80_s1688tf32trmm_objs [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/all_sm90_gz1684trmm_trmm_operations.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_nn_rs_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_cn_ls_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_nn_ls_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_nn_rs_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_nn_rs_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_nn_rs_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_cn_ls_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_tn_ls_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_nn_rs_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_nn_ls_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_cn_rs_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_cn_ls_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_nn_rs_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_nn_rs_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_tn_ls_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_nn_ls_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_cn_rs_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_tn_ls_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_tn_ls_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_cn_ls_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_nn_rs_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_tn_ls_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_tn_ls_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_nn_ls_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_cn_rs_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_tn_ls_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_cn_ls_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_tn_rs_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_nn_rs_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_tn_ls_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_nn_rs_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_cn_rs_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_tn_rs_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_tn_rs_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_cn_rs_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_tn_ls_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_tn_rs_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_nn_rs_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_tn_rs_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_hn_ls_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_cn_rs_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_tn_rs_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_tn_rs_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_tn_ls_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_nn_rs_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_tn_rs_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_hn_ls_l_un_align1.cu.o [ 68%] Built target cutlass_library_trmm_sm80_s1688trmm_objs [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/all_sm90_z1684trmm_trmm_operations.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_cn_rs_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_nn_ls_l_nu_align1.cu.o [ 68%] Built target cutlass_library_trmm_sm90_d1684trmm_objs [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688hemm_objs.dir/generated/symm/80/c1688hemm/all_sm80_c1688hemm_symm_operations.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688hemm_objs.dir/generated/symm/80/c1688hemm/cutlass_tensorop_c1688hemm_128x64_16x4_n_ls_l_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_tn_ls_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_nn_rs_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_cn_ls_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_hn_ls_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_cn_rs_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_nn_ls_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688hemm_objs.dir/generated/symm/80/c1688hemm/cutlass_tensorop_c1688hemm_128x64_16x4_n_ls_u_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_tn_ls_u_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_tn_ls_l_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_cn_ls_l_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_hn_ls_l_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_hn_ls_u_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_nn_ls_u_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688hemm_objs.dir/generated/symm/80/c1688hemm/cutlass_tensorop_c1688hemm_128x64_16x4_n_rs_l_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_tn_ls_l_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_tn_rs_l_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_cn_ls_u_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_hn_ls_l_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_hn_rs_l_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688hemm_objs.dir/generated/symm/80/c1688hemm/cutlass_tensorop_c1688hemm_128x64_16x4_n_rs_u_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_nn_ls_u_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_tn_ls_u_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_tn_rs_l_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_cn_ls_u_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_hn_ls_u_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_hn_rs_l_un_align1.cu.o [ 69%] Built target cutlass_library_symm_sm80_c1688hemm_objs [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688symm_objs.dir/generated/symm/80/c1688symm/all_sm80_c1688symm_symm_operations.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_nn_rs_l_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688symm_objs.dir/generated/symm/80/c1688symm/cutlass_tensorop_c1688symm_128x64_16x4_n_ls_l_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_tn_ls_u_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_tn_rs_u_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_cn_rs_l_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_hn_ls_u_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_hn_rs_u_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_nn_rs_l_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688symm_objs.dir/generated/symm/80/c1688symm/cutlass_tensorop_c1688symm_128x64_16x4_n_ls_u_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_tn_rs_l_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_tn_rs_u_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_cn_rs_l_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_hn_rs_l_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_hn_rs_u_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688symm_objs.dir/generated/symm/80/c1688symm/cutlass_tensorop_c1688symm_128x64_16x4_n_rs_l_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_nn_rs_u_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_tn_rs_l_un_align1.cu.o [ 69%] Built target cutlass_library_trmm_sm80_z884trmm_objs [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688tf32hemm_objs.dir/generated/symm/80/c1688tf32hemm/all_sm80_c1688tf32hemm_symm_operations.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_cn_rs_u_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_hn_rs_l_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688tf32hemm_objs.dir/generated/symm/80/c1688tf32hemm/cutlass_tensorop_c1688tf32hemm_128x64_16x4_n_ls_l_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688symm_objs.dir/generated/symm/80/c1688symm/cutlass_tensorop_c1688symm_128x64_16x4_n_rs_u_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_nn_rs_u_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_tn_rs_u_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688tf32hemm_objs.dir/generated/symm/80/c1688tf32hemm/cutlass_tensorop_c1688tf32hemm_128x64_16x4_n_ls_u_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_hn_rs_u_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_cn_rs_u_un_align1.cu.o [ 69%] Built target cutlass_library_symm_sm80_c1688symm_objs [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688tf32symm_objs.dir/generated/symm/80/c1688tf32symm/all_sm80_c1688tf32symm_symm_operations.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_tn_rs_u_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688tf32symm_objs.dir/generated/symm/80/c1688tf32symm/cutlass_tensorop_c1688tf32symm_128x64_16x4_n_ls_l_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_tn_ls_l_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688tf32hemm_objs.dir/generated/symm/80/c1688tf32hemm/cutlass_tensorop_c1688tf32hemm_128x64_16x4_n_rs_l_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_hn_rs_u_un_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_hn_ls_l_nu_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688tf32symm_objs.dir/generated/symm/80/c1688tf32symm/cutlass_tensorop_c1688tf32symm_128x64_16x4_n_ls_u_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688tf32hemm_objs.dir/generated/symm/80/c1688tf32hemm/cutlass_tensorop_c1688tf32hemm_128x64_16x4_n_rs_u_align1.cu.o [ 70%] Built target cutlass_library_trmm_sm90_gz1684trmm_objs [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_d884symm_objs.dir/generated/symm/80/d884symm/all_sm80_d884symm_symm_operations.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_tn_ls_l_un_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_d884symm_objs.dir/generated/symm/80/d884symm/cutlass_tensorop_d884symm_128x128_16x3_n_ls_l_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688tf32symm_objs.dir/generated/symm/80/c1688tf32symm/cutlass_tensorop_c1688tf32symm_128x64_16x4_n_rs_l_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_hn_ls_l_un_align1.cu.o [ 70%] Built target cutlass_library_symm_sm80_c1688tf32hemm_objs [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_gz884hemm_objs.dir/generated/symm/80/gz884hemm/all_sm80_gz884hemm_symm_operations.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_gz884hemm_objs.dir/generated/symm/80/gz884hemm/cutlass_tensorop_gz884hemm_64x64_8x3_n_ls_l_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_d884symm_objs.dir/generated/symm/80/d884symm/cutlass_tensorop_d884symm_128x128_16x3_n_ls_u_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_tn_ls_u_nu_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688tf32symm_objs.dir/generated/symm/80/c1688tf32symm/cutlass_tensorop_c1688tf32symm_128x64_16x4_n_rs_u_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_gz884hemm_objs.dir/generated/symm/80/gz884hemm/cutlass_tensorop_gz884hemm_64x64_8x3_n_ls_u_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_hn_ls_u_nu_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_d884symm_objs.dir/generated/symm/80/d884symm/cutlass_tensorop_d884symm_128x128_16x3_n_rs_l_align1.cu.o [ 70%] Built target cutlass_library_symm_sm80_c1688tf32symm_objs [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_gz884symm_objs.dir/generated/symm/80/gz884symm/all_sm80_gz884symm_symm_operations.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_gz884hemm_objs.dir/generated/symm/80/gz884hemm/cutlass_tensorop_gz884hemm_64x64_8x3_n_rs_l_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_tn_ls_u_un_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_gz884symm_objs.dir/generated/symm/80/gz884symm/cutlass_tensorop_gz884symm_64x64_8x3_n_ls_l_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_d884symm_objs.dir/generated/symm/80/d884symm/cutlass_tensorop_d884symm_128x128_16x3_n_rs_u_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_hn_ls_u_un_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_gz884hemm_objs.dir/generated/symm/80/gz884hemm/cutlass_tensorop_gz884hemm_64x64_8x3_n_rs_u_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_gz884symm_objs.dir/generated/symm/80/gz884symm/cutlass_tensorop_gz884symm_64x64_8x3_n_ls_u_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_tn_rs_l_nu_align1.cu.o [ 70%] Built target cutlass_library_symm_sm80_d884symm_objs [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_s1688symm_objs.dir/generated/symm/80/s1688symm/all_sm80_s1688symm_symm_operations.cu.o [ 70%] Built target cutlass_library_symm_sm80_gz884hemm_objs [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_s1688tf32symm_objs.dir/generated/symm/80/s1688tf32symm/all_sm80_s1688tf32symm_symm_operations.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_gz884symm_objs.dir/generated/symm/80/gz884symm/cutlass_tensorop_gz884symm_64x64_8x3_n_rs_l_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_s1688symm_objs.dir/generated/symm/80/s1688symm/cutlass_tensorop_s1688symm_256x128_16x3_n_ls_l_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_s1688tf32symm_objs.dir/generated/symm/80/s1688tf32symm/cutlass_tensorop_s1688tf32symm_256x128_16x3_n_ls_l_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_hn_rs_l_nu_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_gz884symm_objs.dir/generated/symm/80/gz884symm/cutlass_tensorop_gz884symm_64x64_8x3_n_rs_u_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_tn_rs_l_un_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_s1688symm_objs.dir/generated/symm/80/s1688symm/cutlass_tensorop_s1688symm_256x128_16x3_n_ls_u_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_s1688tf32symm_objs.dir/generated/symm/80/s1688tf32symm/cutlass_tensorop_s1688tf32symm_256x128_16x3_n_ls_u_align1.cu.o [ 71%] Built target cutlass_library_symm_sm80_gz884symm_objs [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_z884hemm_objs.dir/generated/symm/80/z884hemm/all_sm80_z884hemm_symm_operations.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_hn_rs_l_un_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_z884hemm_objs.dir/generated/symm/80/z884hemm/cutlass_tensorop_z884hemm_128x64_8x3_n_ls_l_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_tn_rs_u_nu_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_s1688tf32symm_objs.dir/generated/symm/80/s1688tf32symm/cutlass_tensorop_s1688tf32symm_256x128_16x3_n_rs_l_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_s1688symm_objs.dir/generated/symm/80/s1688symm/cutlass_tensorop_s1688symm_256x128_16x3_n_rs_l_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_z884hemm_objs.dir/generated/symm/80/z884hemm/cutlass_tensorop_z884hemm_128x64_8x3_n_ls_u_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_hn_rs_u_nu_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_z884hemm_objs.dir/generated/symm/80/z884hemm/cutlass_tensorop_z884hemm_128x64_8x3_n_rs_l_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_tn_rs_u_un_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_s1688tf32symm_objs.dir/generated/symm/80/s1688tf32symm/cutlass_tensorop_s1688tf32symm_256x128_16x3_n_rs_u_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_s1688symm_objs.dir/generated/symm/80/s1688symm/cutlass_tensorop_s1688symm_256x128_16x3_n_rs_u_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_hn_rs_u_un_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_z884hemm_objs.dir/generated/symm/80/z884hemm/cutlass_tensorop_z884hemm_128x64_8x3_n_rs_u_align1.cu.o [ 71%] Built target cutlass_library_symm_sm80_s1688tf32symm_objs [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_z884symm_objs.dir/generated/symm/80/z884symm/all_sm80_z884symm_symm_operations.cu.o [ 71%] Built target cutlass_library_trmm_sm90_z1684trmm_objs [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_d1684symm_objs.dir/generated/symm/90/d1684symm/all_sm90_d1684symm_symm_operations.cu.o [ 71%] Built target cutlass_library_symm_sm80_s1688symm_objs [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_gz1684hemm_objs.dir/generated/symm/90/gz1684hemm/all_sm90_gz1684hemm_symm_operations.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_z884symm_objs.dir/generated/symm/80/z884symm/cutlass_tensorop_z884symm_128x64_8x3_n_ls_l_align1.cu.o [ 71%] Built target cutlass_library_symm_sm80_z884hemm_objs [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_gz1684symm_objs.dir/generated/symm/90/gz1684symm/all_sm90_gz1684symm_symm_operations.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_d1684symm_objs.dir/generated/symm/90/d1684symm/cutlass_tensorop_d1684symm_128x128x16_1x1x1_3_n_ls_l_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_gz1684hemm_objs.dir/generated/symm/90/gz1684hemm/cutlass_tensorop_gz1684hemm_64x64x8_1x1x1_3_n_ls_l_align1.cu.o [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_gz1684symm_objs.dir/generated/symm/90/gz1684symm/cutlass_tensorop_gz1684symm_64x64x8_1x1x1_3_n_ls_l_align1.cu.o [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_z884symm_objs.dir/generated/symm/80/z884symm/cutlass_tensorop_z884symm_128x64_8x3_n_ls_u_align1.cu.o [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_gz1684hemm_objs.dir/generated/symm/90/gz1684hemm/cutlass_tensorop_gz1684hemm_64x64x8_1x1x1_3_n_ls_u_align1.cu.o [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_d1684symm_objs.dir/generated/symm/90/d1684symm/cutlass_tensorop_d1684symm_128x128x16_1x1x1_3_n_ls_u_align1.cu.o [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_gz1684symm_objs.dir/generated/symm/90/gz1684symm/cutlass_tensorop_gz1684symm_64x64x8_1x1x1_3_n_ls_u_align1.cu.o [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_z884symm_objs.dir/generated/symm/80/z884symm/cutlass_tensorop_z884symm_128x64_8x3_n_rs_l_align1.cu.o [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_gz1684hemm_objs.dir/generated/symm/90/gz1684hemm/cutlass_tensorop_gz1684hemm_64x64x8_1x1x1_3_n_rs_l_align1.cu.o [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_gz1684symm_objs.dir/generated/symm/90/gz1684symm/cutlass_tensorop_gz1684symm_64x64x8_1x1x1_3_n_rs_l_align1.cu.o [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_d1684symm_objs.dir/generated/symm/90/d1684symm/cutlass_tensorop_d1684symm_128x128x16_1x1x1_3_n_rs_l_align1.cu.o [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_z884symm_objs.dir/generated/symm/80/z884symm/cutlass_tensorop_z884symm_128x64_8x3_n_rs_u_align1.cu.o [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_gz1684hemm_objs.dir/generated/symm/90/gz1684hemm/cutlass_tensorop_gz1684hemm_64x64x8_1x1x1_3_n_rs_u_align1.cu.o [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_gz1684symm_objs.dir/generated/symm/90/gz1684symm/cutlass_tensorop_gz1684symm_64x64x8_1x1x1_3_n_rs_u_align1.cu.o [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_d1684symm_objs.dir/generated/symm/90/d1684symm/cutlass_tensorop_d1684symm_128x128x16_1x1x1_3_n_rs_u_align1.cu.o [ 72%] Built target cutlass_library_symm_sm90_gz1684hemm_objs [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_z1684hemm_objs.dir/generated/symm/90/z1684hemm/all_sm90_z1684hemm_symm_operations.cu.o [ 72%] Built target cutlass_library_symm_sm90_gz1684symm_objs [ 72%] Linking CUDA static library libcutlass_symm_sm90_z1684symm.a [ 72%] Built target cutlass_library_symm_sm90_z1684symm_static [ 72%] Linking CUDA static library libcutlass_gemm_sm50_cgemm.a [ 72%] Built target cutlass_library_gemm_sm50_cgemm_static [ 72%] Linking CUDA static library libcutlass_gemm_sm50_dgemm.a [ 72%] Built target cutlass_library_gemm_sm50_dgemm_static [ 72%] Linking CUDA static library libcutlass_gemm_sm50_sgemm.a [ 72%] Built target cutlass_library_gemm_sm50_sgemm_static [ 72%] Linking CUDA static library libcutlass_gemm_sm60_hgemm.a [ 72%] Built target cutlass_library_gemm_sm60_hgemm_static [ 72%] Linking CUDA static library libcutlass_gemm_sm61_igemm_s8.a [ 72%] Built target cutlass_library_gemm_sm61_igemm_s8_static [ 72%] Linking CUDA static library libcutlass_gemm_sm61_s8_igemm_s8.a [ 72%] Built target cutlass_library_gemm_sm61_s8_igemm_s8_static [ 72%] Linking CUDA static library libcutlass_gemm_sm70_f16_s884gemm_f16.a [ 72%] Built target cutlass_library_gemm_sm70_f16_s884gemm_f16_static [ 72%] Linking CUDA static library libcutlass_gemm_sm70_f16_s884gemm_planar_complex_array_f16.a [ 72%] Built target cutlass_library_symm_sm80_z884symm_objs [ 72%] Linking CUDA static library libcutlass_gemm_sm70_f16_s884gemm_planar_complex_f16.a [ 72%] Built target cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_static [ 72%] Linking CUDA static library libcutlass_gemm_sm70_h884gemm.a [ 72%] Built target cutlass_library_gemm_sm70_h884gemm_static [ 72%] Linking CUDA static library libcutlass_gemm_sm70_h884gemm_planar_complex.a [ 72%] Built target cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_static [ 72%] Linking CUDA static library libcutlass_gemm_sm70_h884gemm_planar_complex_array.a [ 72%] Built target cutlass_library_gemm_sm70_h884gemm_planar_complex_static [ 72%] Linking CUDA static library libcutlass_gemm_sm70_s884gemm_f16.a [ 72%] Built target cutlass_library_gemm_sm70_h884gemm_planar_complex_array_static [ 72%] Built target cutlass_library_gemm_sm70_s884gemm_f16_static [ 72%] Linking CUDA static library libcutlass_gemm_sm70_s884gemm_planar_complex_array_f16.a [ 72%] Linking CUDA static library libcutlass_gemm_sm70_s884gemm_planar_complex_f16.a [ 72%] Built target cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_static [ 72%] Linking CUDA static library libcutlass_gemm_sm75_f16_s1688gemm_f16.a [ 72%] Built target cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_static [ 72%] Linking CUDA static library libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_array_f16.a [ 72%] Built target cutlass_library_gemm_sm75_f16_s1688gemm_f16_static [ 72%] Linking CUDA static library libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_f16.a [ 72%] Built target cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_static [ 72%] Linking CUDA static library libcutlass_gemm_sm75_h1688gemm.a [ 72%] Built target cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_static [ 72%] Linking CUDA static library libcutlass_gemm_sm75_h1688gemm_planar_complex.a [ 72%] Built target cutlass_library_gemm_sm75_h1688gemm_static [ 72%] Linking CUDA static library libcutlass_gemm_sm75_h1688gemm_planar_complex_array.a [ 72%] Built target cutlass_library_gemm_sm75_h1688gemm_planar_complex_static [ 72%] Built target cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_static [ 72%] Linking CUDA static library libcutlass_gemm_sm75_i88128xorgemm_b1.a [ 72%] Linking CUDA static library libcutlass_gemm_sm75_i8816gemm_s8.a [ 72%] Built target cutlass_library_gemm_sm75_i88128xorgemm_b1_static [ 72%] Built target cutlass_library_gemm_sm75_i8816gemm_s8_static [ 72%] Linking CUDA static library libcutlass_gemm_sm75_i8816gemm_u8.a [ 72%] Linking CUDA static library libcutlass_gemm_sm75_i8832gemm_s4.a [ 72%] Built target cutlass_library_gemm_sm75_i8816gemm_u8_static [ 72%] Built target cutlass_library_gemm_sm75_i8832gemm_s4_static [ 72%] Linking CUDA static library libcutlass_gemm_sm75_i8832gemm_u4.a [ 72%] Linking CUDA static library libcutlass_gemm_sm75_s1688gemm_f16.a [ 72%] Built target cutlass_library_gemm_sm75_i8832gemm_u4_static [ 72%] Linking CUDA static library libcutlass_gemm_sm75_s1688gemm_planar_complex_array_f16.a [ 72%] Built target cutlass_library_gemm_sm75_s1688gemm_f16_static [ 72%] Linking CUDA static library libcutlass_gemm_sm75_s1688gemm_planar_complex_f16.a [ 72%] Built target cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_static [ 73%] Linking CUDA static library libcutlass_gemm_sm75_s4_i8832gemm_s4.a [ 73%] Built target cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_static [ 73%] Linking CUDA static library libcutlass_gemm_sm75_s8_i8816gemm_s8.a [ 73%] Built target cutlass_library_gemm_sm75_s4_i8832gemm_s4_static [ 73%] Linking CUDA static library libcutlass_gemm_sm75_u4_i8832gemm_u4.a [ 73%] Built target cutlass_library_gemm_sm75_s8_i8816gemm_s8_static [ 73%] Linking CUDA static library libcutlass_gemm_sm75_u8_i8816gemm_u8.a [ 73%] Built target cutlass_library_gemm_sm75_u4_i8832gemm_u4_static [ 73%] Linking CUDA static library libcutlass_gemm_sm80_bf16_s16816gemm_bf16.a [ 73%] Built target cutlass_library_gemm_sm75_u8_i8816gemm_u8_static [ 73%] Linking CUDA static library libcutlass_gemm_sm80_bf16_s16816gemm_bf16_s8.a [ 73%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_s8_static [ 73%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_static [ 73%] Linking CUDA static library libcutlass_gemm_sm80_bf16_s16816gemm_bf16_u8.a [ 73%] Linking CUDA static library libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16.a [ 73%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_u8_static [ 73%] Linking CUDA static library libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_bf16.a [ 73%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_bf16_s16816gemm_s8_bf16.a [ 74%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_bf16_s16816gemm_u8_bf16.a [ 74%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_s8_bf16_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_bf16_s16832spgemm_bf16.a [ 74%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_u8_bf16_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_c1688gemm.a [ 74%] Built target cutlass_library_gemm_sm80_bf16_s16832spgemm_bf16_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_c1688tf32gemm.a [ 74%] Built target cutlass_library_gemm_sm80_c1688gemm_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_cgemm.a [ 74%] Built target cutlass_library_gemm_sm80_c1688tf32gemm_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_d884gemm.a [ 74%] Built target cutlass_library_gemm_sm80_d884gemm_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_dgemm.a [ 74%] Built target cutlass_library_gemm_sm80_dgemm_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_f16_s16816gemm_f16.a [ 74%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_f16_static [ 74%] Built target cutlass_library_gemm_sm80_cgemm_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_f16_s16816gemm_f16_s8.a [ 74%] Linking CUDA static library libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2.a [ 74%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_f16_s8_static [ 74%] Built target cutlass_library_gemm_sm89_s16832fastaccumgemm_e5m2_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_f16_s16816gemm_f16_u8.a [ 74%] Linking CUDA static library libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_array_f16.a [ 74%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_f16_u8_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_f16.a [ 74%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_f16_s16816gemm_s8_f16.a [ 74%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_static [ 74%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_s8_f16_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_f16_s16816gemm_u8_f16.a [ 74%] Linking CUDA static library libcutlass_gemm_sm80_f16_s16832spgemm_f16.a [ 74%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_u8_f16_static [ 74%] Linking CUDA static library libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3.a [ 74%] Built target cutlass_library_gemm_sm80_f16_s16832spgemm_f16_static [ 74%] Built target cutlass_library_gemm_sm89_s16864fastaccumspgemm_e4m3_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_gz884gemm.a [ 74%] Linking CUDA static library libcutlass_gemm_sm80_h16816gemm.a [ 74%] Built target cutlass_library_gemm_sm80_h16816gemm_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_h16816gemm_grouped.a [ 74%] Built target cutlass_library_gemm_sm80_gz884gemm_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_h16816gemm_planar_complex.a [ 74%] Built target cutlass_library_gemm_sm80_h16816gemm_grouped_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_h16816gemm_planar_complex_array.a [ 74%] Built target cutlass_library_gemm_sm80_h16816gemm_planar_complex_static [ 74%] Built target cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_h16816gemm_s8_f16.a [ 74%] Linking CUDA static library libcutlass_gemm_sm80_h16832spgemm.a [ 74%] Built target cutlass_library_gemm_sm80_h16816gemm_s8_f16_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_i168128spgemm_s4.a [ 74%] Built target cutlass_library_gemm_sm80_h16832spgemm_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_i168256andgemm_b1.a [ 74%] Built target cutlass_library_gemm_sm80_i168128spgemm_s4_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_i168256xorgemm_b1.a [ 74%] Built target cutlass_library_gemm_sm80_i168256andgemm_b1_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_i16832gemm_s8.a [ 74%] Built target cutlass_library_gemm_sm80_i168256xorgemm_b1_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_i16832gemm_u8.a [ 74%] Built target cutlass_library_gemm_sm80_i16832gemm_s8_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_i16864gemm_s4.a [ 74%] Built target cutlass_library_gemm_sm80_i16832gemm_u8_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_i16864gemm_u4.a [ 74%] Built target cutlass_library_gemm_sm80_i16864gemm_s4_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_i16864spgemm_s8.a [ 74%] Built target cutlass_library_gemm_sm80_i16864gemm_u4_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_bf16.a [ 74%] Built target cutlass_library_gemm_sm80_i16864spgemm_s8_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_bf16_s8.a [ 74%] Built target cutlass_library_gemm_sm80_s16816gemm_bf16_s8_static [ 74%] Built target cutlass_library_gemm_sm80_s16816gemm_bf16_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_bf16_u8.a [ 74%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_f16.a [ 74%] Built target cutlass_library_gemm_sm80_s16816gemm_bf16_u8_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_f16_s8.a [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_f16_static [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_f16_s8_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_f16_u8.a [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_grouped_bf16.a [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_f16_u8_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_grouped_f16.a [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_grouped_bf16_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_planar_complex_array_bf16.a [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_grouped_f16_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_planar_complex_array_f16.a [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_planar_complex_bf16.a [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_planar_complex_f16.a [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_static [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_s8_bf16.a [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_s8_f16.a [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_s8_f16_static [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_s8_bf16_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_u8_bf16.a [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_u8_f16.a [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_u8_f16_static [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_u8_bf16_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816tf32spgemm.a [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16832spgemm_bf16.a [ 75%] Built target cutlass_library_gemm_sm80_s16816tf32spgemm_static [ 75%] Built target cutlass_library_gemm_sm80_s16832spgemm_bf16_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16832spgemm_f16.a [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s1688bf16gemm.a [ 75%] Built target cutlass_library_gemm_sm80_s16832spgemm_f16_static [ 75%] Built target cutlass_library_gemm_sm80_s1688bf16gemm_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s1688f16gemm.a [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s1688gemm.a [ 75%] Built target cutlass_library_gemm_sm80_s1688f16gemm_static [ 75%] Built target cutlass_library_gemm_sm80_s1688gemm_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s1688gemm_tf32.a [ 75%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_z1684hemm_objs.dir/generated/symm/90/z1684hemm/cutlass_tensorop_z1684hemm_128x64x8_1x1x1_3_n_ls_l_align1.cu.o [ 75%] Built target cutlass_library_gemm_sm80_s1688gemm_tf32_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s1688tf32gemm.a [ 75%] Built target cutlass_library_gemm_sm80_s1688tf32gemm_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s4_i168128spgemm_s4.a [ 75%] Built target cutlass_library_gemm_sm80_s4_i168128spgemm_s4_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s4_i16864gemm_s4.a [ 75%] Built target cutlass_library_gemm_sm80_s4_i16864gemm_s4_static [ 76%] Linking CUDA static library libcutlass_gemm_sm80_s8_i16832gemm_s8.a [ 76%] Built target cutlass_library_gemm_sm80_s8_i16832gemm_s8_static [ 76%] Linking CUDA static library libcutlass_gemm_sm80_s8_i16864spgemm_s8.a [ 76%] Built target cutlass_library_gemm_sm80_s8_i16864spgemm_s8_static [ 76%] Linking CUDA static library libcutlass_gemm_sm80_sgemm.a [ 76%] Built target cutlass_library_gemm_sm80_sgemm_static [ 76%] Linking CUDA static library libcutlass_gemm_sm80_tf32_s1688gemm_tf32.a [ 76%] Built target cutlass_library_gemm_sm80_tf32_s1688gemm_tf32_static [ 76%] Linking CUDA static library libcutlass_gemm_sm80_u4_i16864gemm_u4.a [ 76%] Built target cutlass_library_gemm_sm80_u4_i16864gemm_u4_static [ 76%] Linking CUDA static library libcutlass_gemm_sm80_u8_i16832gemm_u8.a [ 76%] Built target cutlass_library_gemm_sm80_u8_i16832gemm_u8_static [ 76%] Linking CUDA static library libcutlass_gemm_sm80_z884gemm.a [ 76%] Built target cutlass_library_gemm_sm80_z884gemm_static [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3.a [ 76%] Built target cutlass_library_gemm_sm89_s16832fastaccumgemm_e4m3_static [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3_e5m2.a [ 76%] Built target cutlass_library_gemm_sm89_s16832fastaccumgemm_e4m3_e5m2_static [ 76%] Linking CUDA static library libcutlass_conv2d_sm80_s1688f16fprop_optimized.a [ 76%] Built target cutlass_library_conv2d_sm80_s1688f16fprop_optimized_static [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2_e4m3.a [ 76%] Built target cutlass_library_gemm_sm89_s16832fastaccumgemm_e5m2_e4m3_static [ 76%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_z1684hemm_objs.dir/generated/symm/90/z1684hemm/cutlass_tensorop_z1684hemm_128x64x8_1x1x1_3_n_ls_u_align1.cu.o [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16832gemm_e4m3.a [ 76%] Built target cutlass_library_gemm_sm89_s16832gemm_e4m3_static [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16832gemm_e4m3_e5m2.a [ 76%] Built target cutlass_library_gemm_sm89_s16832gemm_e4m3_e5m2_static [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16832gemm_e5m2.a [ 76%] Built target cutlass_library_gemm_sm89_s16832gemm_e5m2_static [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16832gemm_e5m2_e4m3.a [ 76%] Built target cutlass_library_gemm_sm89_s16832gemm_e5m2_e4m3_static [ 76%] Linking CUDA static library libcutlass_conv2d_sm80_s1688tf32wgrad_optimized.a [ 76%] Built target cutlass_library_conv2d_sm80_s1688tf32wgrad_optimized_static [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3_e5m2.a [ 76%] Built target cutlass_library_gemm_sm89_s16864fastaccumspgemm_e4m3_e5m2_static [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2.a [ 76%] Built target cutlass_library_gemm_sm89_s16864fastaccumspgemm_e5m2_static [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2_e4m3.a [ 76%] Built target cutlass_library_gemm_sm89_s16864fastaccumspgemm_e5m2_e4m3_static [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16864spgemm_e4m3.a [ 76%] Built target cutlass_library_gemm_sm89_s16864spgemm_e4m3_static [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16864spgemm_e4m3_e5m2.a [ 76%] Built target cutlass_library_gemm_sm89_s16864spgemm_e4m3_e5m2_static [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16864spgemm_e5m2.a [ 76%] Built target cutlass_library_gemm_sm89_s16864spgemm_e5m2_static [ 77%] Linking CUDA static library libcutlass_gemm_sm89_s16864spgemm_e5m2_e4m3.a [ 77%] Built target cutlass_library_gemm_sm89_s16864spgemm_e5m2_e4m3_static [ 77%] Linking CUDA static library libcutlass_gemm_sm90_bf16_s64x128x16gemm_bf16.a [ 77%] Built target cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_static [ 77%] Linking CUDA static library libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3.a [ 77%] Built target cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_static [ 77%] Linking CUDA static library libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2.a [ 77%] Built target cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_static [ 77%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_z1684hemm_objs.dir/generated/symm/90/z1684hemm/cutlass_tensorop_z1684hemm_128x64x8_1x1x1_3_n_rs_l_align1.cu.o [ 77%] Built target cutlass_library_symm_sm90_d1684symm_objs [ 77%] Linking CUDA static library libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2.a [ 77%] Built target cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_static [ 77%] Linking CUDA static library libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3.a [ 77%] Built target cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_static [ 77%] Linking CUDA static library libcutlass_gemm_sm90_d1684gemm.a [ 77%] Built target cutlass_library_gemm_sm90_d1684gemm_static [ 77%] Linking CUDA static library libcutlass_gemm_sm90_f16_s64x128x16gemm_f16.a [ 77%] Built target cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_static [ 78%] Linking CUDA static library libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3.a [ 78%] Built target cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_static [ 78%] Linking CUDA static library libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2.a [ 78%] Built target cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_static [ 78%] Linking CUDA static library libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2.a [ 78%] Built target cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_static [ 78%] Linking CUDA static library libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3.a [ 78%] Built target cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_gz1684gemm.a [ 79%] Built target cutlass_library_gemm_sm90_gz1684gemm_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_h64x128x16gemm.a [ 79%] Built target cutlass_library_gemm_sm90_h64x128x16gemm_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_i64x128x32gemm_s8.a [ 79%] Built target cutlass_library_gemm_sm90_i64x128x32gemm_s8_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_i64x128x32gemm_u8.a [ 79%] Built target cutlass_library_gemm_sm90_i64x128x32gemm_u8_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_s64x128x16gemm_bf16.a [ 79%] Built target cutlass_library_gemm_sm90_s64x128x16gemm_bf16_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_s64x128x16gemm_f16.a [ 79%] Built target cutlass_library_gemm_sm90_s64x128x16gemm_f16_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_s64x128x32gemm_e4m3.a [ 79%] Built target cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_s64x128x32gemm_e4m3_e5m2.a [ 79%] Built target cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_s64x128x32gemm_e5m2.a [ 79%] Built target cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_s64x128x32gemm_e5m2_e4m3.a [ 79%] Built target cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_s64x128x8gemm_tf32.a [ 79%] Built target cutlass_library_gemm_sm90_s64x128x8gemm_tf32_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_s64x128x8tf32gemm.a [ 79%] Built target cutlass_library_gemm_sm90_s64x128x8tf32gemm_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_s8_i64x128x32gemm_s8.a [ 79%] Built target cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_s8_i64x128x32gemm_u8.a [ 79%] Built target cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_void_i64x128x32gemm_s8.a [ 79%] Built target cutlass_library_gemm_sm90_void_i64x128x32gemm_s8_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_void_i64x128x32gemm_u8.a [ 79%] Built target cutlass_library_gemm_sm90_void_i64x128x32gemm_u8_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_void_s64x128x16gemm_bf16.a [ 79%] Built target cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_void_s64x128x16gemm_f16.a [ 79%] Built target cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3.a [ 79%] Built target cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2.a [ 79%] Built target cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2.a [ 79%] Built target cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3.a [ 79%] Built target cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_z1684gemm.a [ 79%] Built target cutlass_library_gemm_sm90_z1684gemm_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm50_cf32_cdgrad_optimized_cf32.a [ 79%] Built target cutlass_library_conv2d_sm50_cf32_cdgrad_optimized_cf32_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm50_cf32_cfprop_optimized_cf32.a [ 79%] Built target cutlass_library_conv2d_sm50_cf32_cfprop_optimized_cf32_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm50_cf32_cwgrad_optimized_cf32.a [ 79%] Built target cutlass_library_conv2d_sm50_cf32_cwgrad_optimized_cf32_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm50_sdgrad_optimized.a [ 79%] Built target cutlass_library_conv2d_sm50_sdgrad_optimized_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm50_sfprop_optimized.a [ 79%] Built target cutlass_library_conv2d_sm50_sfprop_optimized_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm50_swgrad_optimized.a [ 79%] Built target cutlass_library_conv2d_sm50_swgrad_optimized_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm60_hfprop_optimized.a [ 79%] Built target cutlass_library_conv2d_sm60_hfprop_optimized_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm70_f16_s884dgrad_optimized_f16.a [ 79%] Built target cutlass_library_conv2d_sm70_f16_s884dgrad_optimized_f16_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm70_f16_s884fprop_optimized_f16.a [ 79%] Built target cutlass_library_conv2d_sm70_f16_s884fprop_optimized_f16_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm70_f16_s884wgrad_optimized_f16.a [ 79%] Built target cutlass_library_conv2d_sm70_f16_s884wgrad_optimized_f16_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm70_h884dgrad_optimized.a [ 79%] Built target cutlass_library_conv2d_sm70_h884dgrad_optimized_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm70_h884fprop_optimized.a [ 79%] Built target cutlass_library_conv2d_sm70_h884fprop_optimized_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm70_h884wgrad_optimized.a [ 79%] Built target cutlass_library_conv2d_sm70_h884wgrad_optimized_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm70_s884dgrad_optimized_f16.a [ 79%] Built target cutlass_library_conv2d_sm70_s884dgrad_optimized_f16_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm70_s884fprop_optimized_f16.a [ 79%] Built target cutlass_library_conv2d_sm70_s884fprop_optimized_f16_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm70_s884wgrad_optimized_f16.a [ 79%] Built target cutlass_library_conv2d_sm70_s884wgrad_optimized_f16_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm75_cf32_cdgrad_optimized_cf32.a [ 79%] Built target cutlass_library_conv2d_sm75_cf32_cdgrad_optimized_cf32_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm75_cf32_cfprop_optimized_cf32.a [ 79%] Built target cutlass_library_conv2d_sm75_cf32_cfprop_optimized_cf32_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm75_cf32_cwgrad_optimized_cf32.a [ 79%] Built target cutlass_library_conv2d_sm75_cf32_cwgrad_optimized_cf32_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm75_f16_s1688dgrad_optimized_f16.a [ 79%] Built target cutlass_library_conv2d_sm75_f16_s1688dgrad_optimized_f16_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm75_f16_s1688fprop_few_channels_f16.a [ 79%] Built target cutlass_library_conv2d_sm75_f16_s1688fprop_few_channels_f16_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm75_f16_s1688fprop_fixed_channels_f16.a [ 79%] Built target cutlass_library_conv2d_sm75_f16_s1688fprop_fixed_channels_f16_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm75_f16_s1688fprop_optimized_f16.a [ 79%] Built target cutlass_library_conv2d_sm75_f16_s1688fprop_optimized_f16_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm75_f16_s1688wgrad_optimized_f16.a [ 79%] Built target cutlass_library_conv2d_sm75_f16_s1688wgrad_optimized_f16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_h1688dgrad_optimized.a [ 80%] Built target cutlass_library_conv2d_sm75_h1688dgrad_optimized_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_h1688fprop_few_channels.a [ 80%] Built target cutlass_library_conv2d_sm75_h1688fprop_few_channels_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_h1688fprop_fixed_channels.a [ 80%] Built target cutlass_library_conv2d_sm75_h1688fprop_fixed_channels_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_h1688fprop_optimized.a [ 80%] Built target cutlass_library_conv2d_sm75_h1688fprop_optimized_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_h1688wgrad_optimized.a [ 80%] Built target cutlass_library_conv2d_sm75_h1688wgrad_optimized_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_i8816fprop_optimized_s8.a [ 80%] Built target cutlass_library_conv2d_sm75_i8816fprop_optimized_s8_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_i8816fprop_optimized_u8.a [ 80%] Built target cutlass_library_conv2d_sm75_i8816fprop_optimized_u8_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_i8832fprop_optimized_s4.a [ 80%] Built target cutlass_library_conv2d_sm75_i8832fprop_optimized_s4_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_i8832fprop_optimized_u4.a [ 80%] Built target cutlass_library_conv2d_sm75_i8832fprop_optimized_u4_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_s1688dgrad_optimized_f16.a [ 80%] Built target cutlass_library_conv2d_sm75_s1688dgrad_optimized_f16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_s1688fprop_few_channels_f16.a [ 80%] Built target cutlass_library_conv2d_sm75_s1688fprop_few_channels_f16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_s1688fprop_fixed_channels_f16.a [ 80%] Built target cutlass_library_conv2d_sm75_s1688fprop_fixed_channels_f16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_s1688fprop_optimized_f16.a [ 80%] Built target cutlass_library_conv2d_sm75_s1688fprop_optimized_f16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_s1688wgrad_optimized_f16.a [ 80%] Built target cutlass_library_conv2d_sm75_s1688wgrad_optimized_f16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_s4_i8832fprop_optimized_s4.a [ 80%] Built target cutlass_library_conv2d_sm75_s4_i8832fprop_optimized_s4_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_s8_i8816fprop_few_channels_s8.a [ 80%] Built target cutlass_library_conv2d_sm75_s8_i8816fprop_few_channels_s8_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_s8_i8816fprop_fixed_channels_s8.a [ 80%] Built target cutlass_library_conv2d_sm75_s8_i8816fprop_fixed_channels_s8_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_s8_i8816fprop_optimized_s8.a [ 80%] Built target cutlass_library_conv2d_sm75_s8_i8816fprop_optimized_s8_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_u4_i8832fprop_optimized_u4.a [ 80%] Built target cutlass_library_conv2d_sm75_u4_i8832fprop_optimized_u4_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_u8_i8816fprop_few_channels_u8.a [ 80%] Built target cutlass_library_conv2d_sm75_u8_i8816fprop_few_channels_u8_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_u8_i8816fprop_fixed_channels_u8.a [ 80%] Built target cutlass_library_conv2d_sm75_u8_i8816fprop_fixed_channels_u8_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_u8_i8816fprop_optimized_u8.a [ 80%] Built target cutlass_library_conv2d_sm75_u8_i8816fprop_optimized_u8_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_bf16_s16816dgrad_optimized_bf16.a [ 80%] Built target cutlass_library_conv2d_sm80_bf16_s16816dgrad_optimized_bf16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_bf16_s16816fprop_fixed_channels_bf16.a [ 80%] Built target cutlass_library_conv2d_sm80_bf16_s16816fprop_fixed_channels_bf16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_bf16_s16816fprop_optimized_bf16.a [ 80%] Built target cutlass_library_conv2d_sm80_bf16_s16816fprop_optimized_bf16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_bf16_s16816wgrad_optimized_bf16.a [ 80%] Built target cutlass_library_conv2d_sm80_bf16_s16816wgrad_optimized_bf16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_f16_s16816dgrad_optimized_f16.a [ 80%] Built target cutlass_library_conv2d_sm80_f16_s16816dgrad_optimized_f16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_f16_s16816fprop_fixed_channels_f16.a [ 80%] Built target cutlass_library_conv2d_sm80_f16_s16816fprop_fixed_channels_f16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_f16_s16816fprop_optimized_f16.a [ 80%] Built target cutlass_library_conv2d_sm80_f16_s16816fprop_optimized_f16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_f16_s16816wgrad_optimized_f16.a [ 80%] Built target cutlass_library_conv2d_sm80_f16_s16816wgrad_optimized_f16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_h16816dgrad_optimized.a [ 80%] Built target cutlass_library_conv2d_sm80_h16816dgrad_optimized_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_h16816fprop_fixed_channels.a [ 80%] Built target cutlass_library_conv2d_sm80_h16816fprop_fixed_channels_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_h16816fprop_optimized.a [ 80%] Built target cutlass_library_conv2d_sm80_h16816fprop_optimized_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_h16816wgrad_optimized.a [ 80%] Built target cutlass_library_conv2d_sm80_h16816wgrad_optimized_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_i16832fprop_optimized_s8.a [ 80%] Built target cutlass_library_conv2d_sm80_i16832fprop_optimized_s8_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_i16832fprop_optimized_u8.a [ 80%] Built target cutlass_library_conv2d_sm80_i16832fprop_optimized_u8_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_i16864fprop_optimized_s4.a [ 80%] Built target cutlass_library_conv2d_sm80_i16864fprop_optimized_s4_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_i16864fprop_optimized_u4.a [ 80%] Built target cutlass_library_conv2d_sm80_i16864fprop_optimized_u4_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_s16816dgrad_optimized_bf16.a [ 80%] Built target cutlass_library_conv2d_sm80_s16816dgrad_optimized_bf16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_s16816dgrad_optimized_f16.a [ 80%] Built target cutlass_library_conv2d_sm80_s16816dgrad_optimized_f16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_s16816fprop_fixed_channels_bf16.a [ 80%] Built target cutlass_library_conv2d_sm80_s16816fprop_fixed_channels_bf16_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s16816fprop_fixed_channels_f16.a [ 81%] Built target cutlass_library_conv2d_sm80_s16816fprop_fixed_channels_f16_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s16816fprop_optimized_bf16.a [ 81%] Built target cutlass_library_conv2d_sm80_s16816fprop_optimized_bf16_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s16816fprop_optimized_f16.a [ 81%] Built target cutlass_library_conv2d_sm80_s16816fprop_optimized_f16_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s16816wgrad_optimized_bf16.a [ 81%] Built target cutlass_library_conv2d_sm80_s16816wgrad_optimized_bf16_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s16816wgrad_optimized_f16.a [ 81%] Built target cutlass_library_conv2d_sm80_s16816wgrad_optimized_f16_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688bf16dgrad_optimized.a [ 81%] Built target cutlass_library_conv2d_sm80_s1688bf16dgrad_optimized_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688bf16fprop_optimized.a [ 81%] Built target cutlass_library_conv2d_sm80_s1688bf16fprop_optimized_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688bf16wgrad_optimized.a [ 81%] Built target cutlass_library_conv2d_sm80_s1688bf16wgrad_optimized_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688dgrad_optimized.a [ 81%] Built target cutlass_library_conv2d_sm80_s1688dgrad_optimized_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688dgrad_optimized_tf32.a [ 81%] Built target cutlass_library_conv2d_sm80_s1688dgrad_optimized_tf32_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688f16dgrad_optimized.a [ 81%] Built target cutlass_library_conv2d_sm80_s1688f16dgrad_optimized_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688f16wgrad_optimized.a [ 81%] Built target cutlass_library_conv2d_sm80_s1688f16wgrad_optimized_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688fprop_optimized.a [ 81%] Built target cutlass_library_conv2d_sm80_s1688fprop_optimized_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688fprop_optimized_tf32.a [ 81%] Built target cutlass_library_conv2d_sm80_s1688fprop_optimized_tf32_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688tf32dgrad_optimized.a [ 81%] Built target cutlass_library_conv2d_sm80_s1688tf32dgrad_optimized_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688tf32fprop_optimized.a [ 81%] Built target cutlass_library_conv2d_sm80_s1688tf32fprop_optimized_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688wgrad_optimized.a [ 81%] Built target cutlass_library_conv2d_sm80_s1688wgrad_optimized_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688wgrad_optimized_tf32.a [ 81%] Built target cutlass_library_conv2d_sm80_s1688wgrad_optimized_tf32_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s4_i16864fprop_optimized_s4.a [ 81%] Built target cutlass_library_conv2d_sm80_s4_i16864fprop_optimized_s4_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s8_i16832fprop_few_channels_s8.a [ 81%] Built target cutlass_library_conv2d_sm80_s8_i16832fprop_few_channels_s8_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s8_i16832fprop_fixed_channels_s8.a [ 81%] Built target cutlass_library_conv2d_sm80_s8_i16832fprop_fixed_channels_s8_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm80_s8_i16832fprop_optimized_s8.a [ 82%] Built target cutlass_library_conv2d_sm80_s8_i16832fprop_optimized_s8_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm80_sdgrad_optimized.a [ 82%] Built target cutlass_library_conv2d_sm80_sdgrad_optimized_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm80_sfprop_optimized.a [ 82%] Built target cutlass_library_conv2d_sm80_sfprop_optimized_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm80_swgrad_optimized.a [ 82%] Built target cutlass_library_conv2d_sm80_swgrad_optimized_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm80_tf32_s1688dgrad_optimized_tf32.a [ 82%] Built target cutlass_library_conv2d_sm80_tf32_s1688dgrad_optimized_tf32_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm80_tf32_s1688fprop_optimized_tf32.a [ 82%] Built target cutlass_library_conv2d_sm80_tf32_s1688fprop_optimized_tf32_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm80_tf32_s1688wgrad_optimized_tf32.a [ 82%] Built target cutlass_library_conv2d_sm80_tf32_s1688wgrad_optimized_tf32_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm80_u4_i16864fprop_optimized_u4.a [ 82%] Built target cutlass_library_conv2d_sm80_u4_i16864fprop_optimized_u4_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm80_u8_i16832fprop_few_channels_u8.a [ 82%] Built target cutlass_library_conv2d_sm80_u8_i16832fprop_few_channels_u8_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm80_u8_i16832fprop_fixed_channels_u8.a [ 82%] Built target cutlass_library_conv2d_sm80_u8_i16832fprop_fixed_channels_u8_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm80_u8_i16832fprop_optimized_u8.a [ 82%] Built target cutlass_library_conv2d_sm80_u8_i16832fprop_optimized_u8_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e4m3.a [ 82%] Built target cutlass_library_conv2d_sm89_s16832fprop_fixed_channels_e4m3_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e5m2.a [ 82%] Built target cutlass_library_conv2d_sm89_s16832fprop_fixed_channels_e5m2_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm89_s16832fprop_optimized_e4m3.a [ 82%] Built target cutlass_library_conv2d_sm89_s16832fprop_optimized_e4m3_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm89_s16832fprop_optimized_e5m2.a [ 82%] Built target cutlass_library_conv2d_sm89_s16832fprop_optimized_e5m2_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16.a [ 82%] Built target cutlass_library_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16.a [ 82%] Built target cutlass_library_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32.a [ 82%] Built target cutlass_library_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32.a [ 82%] Built target cutlass_library_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32.a [ 82%] Built target cutlass_library_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32.a [ 82%] Built target cutlass_library_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_bf16_s16816dgrad3d_analytic_bf16.a [ 82%] Built target cutlass_library_conv3d_sm80_bf16_s16816dgrad3d_analytic_bf16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_bf16_s16816dgrad3d_optimized_bf16.a [ 82%] Built target cutlass_library_conv3d_sm80_bf16_s16816dgrad3d_optimized_bf16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_bf16_s16816fprop3d_optimized_bf16.a [ 82%] Built target cutlass_library_conv3d_sm80_bf16_s16816fprop3d_optimized_bf16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_bf16_s16816wgrad3d_optimized_bf16.a [ 82%] Built target cutlass_library_conv3d_sm80_bf16_s16816wgrad3d_optimized_bf16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_f16_s16816dgrad3d_analytic_f16.a [ 82%] Built target cutlass_library_conv3d_sm80_f16_s16816dgrad3d_analytic_f16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_f16_s16816dgrad3d_optimized_f16.a [ 82%] Built target cutlass_library_conv3d_sm80_f16_s16816dgrad3d_optimized_f16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_f16_s16816fprop3d_optimized_f16.a [ 82%] Built target cutlass_library_conv3d_sm80_f16_s16816fprop3d_optimized_f16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_f16_s16816wgrad3d_optimized_f16.a [ 82%] Built target cutlass_library_conv3d_sm80_f16_s16816wgrad3d_optimized_f16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_h16816dgrad3d_analytic.a [ 82%] Built target cutlass_library_conv3d_sm80_h16816dgrad3d_analytic_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_h16816dgrad3d_optimized.a [ 82%] Built target cutlass_library_conv3d_sm80_h16816dgrad3d_optimized_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_h16816fprop3d_optimized.a [ 82%] Built target cutlass_library_conv3d_sm80_h16816fprop3d_optimized_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_h16816wgrad3d_optimized.a [ 82%] Built target cutlass_library_conv3d_sm80_h16816wgrad3d_optimized_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_s16816dgrad3d_analytic_bf16.a [ 82%] Built target cutlass_library_conv3d_sm80_s16816dgrad3d_analytic_bf16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_s16816dgrad3d_analytic_f16.a [ 82%] Built target cutlass_library_conv3d_sm80_s16816dgrad3d_analytic_f16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_s16816dgrad3d_optimized_bf16.a [ 82%] Built target cutlass_library_conv3d_sm80_s16816dgrad3d_optimized_bf16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_s16816dgrad3d_optimized_f16.a [ 82%] Built target cutlass_library_conv3d_sm80_s16816dgrad3d_optimized_f16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_s16816fprop3d_optimized_bf16.a [ 82%] Built target cutlass_library_conv3d_sm80_s16816fprop3d_optimized_bf16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_s16816fprop3d_optimized_f16.a [ 82%] Built target cutlass_library_conv3d_sm80_s16816fprop3d_optimized_f16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_s16816wgrad3d_optimized_bf16.a [ 82%] Built target cutlass_library_conv3d_sm80_s16816wgrad3d_optimized_bf16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_s16816wgrad3d_optimized_f16.a [ 82%] Built target cutlass_library_conv3d_sm80_s16816wgrad3d_optimized_f16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16.a [ 82%] Built target cutlass_library_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16.a [ 82%] Built target cutlass_library_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32.a [ 82%] Built target cutlass_library_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32.a [ 82%] Built target cutlass_library_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32.a [ 82%] Built target cutlass_library_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32.a [ 82%] Built target cutlass_library_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32_static [ 82%] Linking CUDA static library libcutlass_rank_k_sm80_c1688herk.a [ 82%] Built target cutlass_library_rank_k_sm80_c1688herk_static [ 82%] Linking CUDA static library libcutlass_rank_k_sm80_c1688syrk.a [ 82%] Built target cutlass_library_rank_k_sm80_c1688syrk_static [ 82%] Linking CUDA static library libcutlass_rank_k_sm80_c1688tf32herk.a [ 82%] Built target cutlass_library_rank_k_sm80_c1688tf32herk_static [ 82%] Linking CUDA static library libcutlass_rank_k_sm80_c1688tf32syrk.a [ 82%] Built target cutlass_library_rank_k_sm80_c1688tf32syrk_static [ 82%] Linking CUDA static library libcutlass_rank_k_sm80_d884syrk.a [ 82%] Built target cutlass_library_rank_k_sm80_d884syrk_static [ 82%] Linking CUDA static library libcutlass_rank_k_sm80_gz884herk.a [ 82%] Built target cutlass_library_rank_k_sm80_gz884herk_static [ 82%] Linking CUDA static library libcutlass_rank_k_sm80_gz884syrk.a [ 82%] Built target cutlass_library_rank_k_sm80_gz884syrk_static [ 82%] Linking CUDA static library libcutlass_rank_k_sm80_s1688syrk.a [ 82%] Built target cutlass_library_rank_k_sm80_s1688syrk_static [ 83%] Linking CUDA static library libcutlass_rank_k_sm80_s1688tf32syrk.a [ 83%] Built target cutlass_library_rank_k_sm80_s1688tf32syrk_static [ 83%] Linking CUDA static library libcutlass_rank_k_sm80_z884herk.a [ 83%] Built target cutlass_library_rank_k_sm80_z884herk_static [ 83%] Linking CUDA static library libcutlass_rank_k_sm80_z884syrk.a [ 83%] Built target cutlass_library_rank_k_sm80_z884syrk_static [ 83%] Linking CUDA static library libcutlass_rank_k_sm90_d1684syrk.a [ 83%] Built target cutlass_library_rank_k_sm90_d1684syrk_static [ 83%] Linking CUDA static library libcutlass_rank_k_sm90_gz1684herk.a [ 83%] Built target cutlass_library_rank_k_sm90_gz1684herk_static [ 83%] Linking CUDA static library libcutlass_rank_k_sm90_gz1684syrk.a [ 83%] Built target cutlass_library_rank_k_sm90_gz1684syrk_static [ 83%] Linking CUDA static library libcutlass_rank_k_sm90_z1684herk.a [ 83%] Built target cutlass_library_rank_k_sm90_z1684herk_static [ 83%] Linking CUDA static library libcutlass_rank_k_sm90_z1684syrk.a [ 83%] Built target cutlass_library_rank_k_sm90_z1684syrk_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm80_c1688her2k.a [ 83%] Built target cutlass_library_rank_2k_sm80_c1688her2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm80_c1688syr2k.a [ 83%] Built target cutlass_library_rank_2k_sm80_c1688syr2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm80_c1688tf32her2k.a [ 83%] Built target cutlass_library_rank_2k_sm80_c1688tf32her2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm80_c1688tf32syr2k.a [ 83%] Built target cutlass_library_rank_2k_sm80_c1688tf32syr2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm80_d884syr2k.a [ 83%] Built target cutlass_library_rank_2k_sm80_d884syr2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm80_gz884her2k.a [ 83%] Built target cutlass_library_rank_2k_sm80_gz884her2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm80_gz884syr2k.a [ 83%] Built target cutlass_library_rank_2k_sm80_gz884syr2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm80_s1688syr2k.a [ 83%] Built target cutlass_library_rank_2k_sm80_s1688syr2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm80_s1688tf32syr2k.a [ 83%] Built target cutlass_library_rank_2k_sm80_s1688tf32syr2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm80_z884her2k.a [ 83%] Built target cutlass_library_rank_2k_sm80_z884her2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm80_z884syr2k.a [ 83%] Built target cutlass_library_rank_2k_sm80_z884syr2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm90_d1684syr2k.a [ 83%] Built target cutlass_library_rank_2k_sm90_d1684syr2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm90_gz1684her2k.a [ 83%] Built target cutlass_library_rank_2k_sm90_gz1684her2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm90_gz1684syr2k.a [ 83%] Built target cutlass_library_rank_2k_sm90_gz1684syr2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm90_z1684her2k.a [ 83%] Built target cutlass_library_rank_2k_sm90_z1684her2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm90_z1684syr2k.a [ 83%] Built target cutlass_library_rank_2k_sm90_z1684syr2k_static [ 83%] Linking CUDA static library libcutlass_trmm_sm80_c1688tf32trmm.a [ 83%] Built target cutlass_library_trmm_sm80_c1688tf32trmm_static [ 83%] Linking CUDA static library libcutlass_trmm_sm80_c1688trmm.a [ 83%] Built target cutlass_library_trmm_sm80_c1688trmm_static [ 83%] Linking CUDA static library libcutlass_trmm_sm80_d884trmm.a [ 83%] Built target cutlass_library_trmm_sm80_d884trmm_static [ 83%] Linking CUDA static library libcutlass_trmm_sm80_gz884trmm.a [ 83%] Built target cutlass_library_trmm_sm80_gz884trmm_static [ 83%] Linking CUDA static library libcutlass_trmm_sm80_s1688tf32trmm.a [ 83%] Built target cutlass_library_trmm_sm80_s1688tf32trmm_static [ 83%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_z1684hemm_objs.dir/generated/symm/90/z1684hemm/cutlass_tensorop_z1684hemm_128x64x8_1x1x1_3_n_rs_u_align1.cu.o [ 83%] Linking CUDA static library libcutlass_trmm_sm80_s1688trmm.a [ 83%] Built target cutlass_library_trmm_sm80_s1688trmm_static [ 83%] Linking CUDA static library libcutlass_trmm_sm80_z884trmm.a [ 83%] Built target cutlass_library_trmm_sm80_z884trmm_static [ 83%] Linking CUDA static library libcutlass_trmm_sm90_d1684trmm.a [ 83%] Built target cutlass_library_trmm_sm90_d1684trmm_static [ 83%] Linking CUDA static library libcutlass_trmm_sm90_gz1684trmm.a [ 83%] Built target cutlass_library_trmm_sm90_gz1684trmm_static [ 83%] Linking CUDA static library libcutlass_trmm_sm90_z1684trmm.a [ 83%] Built target cutlass_library_trmm_sm90_z1684trmm_static [ 83%] Linking CUDA static library libcutlass_symm_sm80_c1688hemm.a [ 83%] Built target cutlass_library_symm_sm80_c1688hemm_static [ 83%] Linking CUDA static library libcutlass_symm_sm80_c1688symm.a [ 83%] Built target cutlass_library_symm_sm80_c1688symm_static [ 83%] Linking CUDA static library libcutlass_symm_sm80_c1688tf32hemm.a [ 83%] Built target cutlass_library_symm_sm80_c1688tf32hemm_static [ 83%] Linking CUDA static library libcutlass_symm_sm80_c1688tf32symm.a [ 83%] Built target cutlass_library_symm_sm80_c1688tf32symm_static [ 83%] Linking CUDA static library libcutlass_symm_sm80_d884symm.a [ 83%] Built target cutlass_library_symm_sm80_d884symm_static [ 83%] Linking CUDA static library libcutlass_symm_sm80_gz884hemm.a [ 83%] Built target cutlass_library_symm_sm80_gz884hemm_static [ 83%] Linking CUDA static library libcutlass_symm_sm80_gz884symm.a [ 83%] Built target cutlass_library_symm_sm80_gz884symm_static [ 83%] Linking CUDA static library libcutlass_symm_sm80_s1688symm.a [ 83%] Built target cutlass_library_symm_sm80_s1688symm_static [ 83%] Linking CUDA static library libcutlass_symm_sm80_s1688tf32symm.a [ 83%] Built target cutlass_library_symm_sm80_s1688tf32symm_static [ 83%] Linking CUDA static library libcutlass_symm_sm80_z884hemm.a [ 83%] Built target cutlass_library_symm_sm80_z884hemm_static [ 83%] Linking CUDA static library libcutlass_symm_sm80_z884symm.a [ 83%] Built target cutlass_library_symm_sm80_z884symm_static [ 83%] Linking CUDA static library libcutlass_symm_sm90_d1684symm.a [ 83%] Linking CUDA static library libcutlass_symm_sm90_gz1684hemm.a [ 83%] Built target cutlass_library_symm_sm90_d1684symm_static [ 83%] Built target cutlass_library_symm_sm90_gz1684hemm_static [ 83%] Linking CUDA static library libcutlass_symm_sm90_gz1684symm.a [ 83%] Linking CUDA shared library libcutlass_symm_sm90_z1684symm.so [ 83%] Built target cutlass_library_symm_sm90_gz1684symm_static [ 83%] Linking CUDA shared library libcutlass_gemm_sm50_cgemm.so [ 83%] Built target cutlass_library_symm_sm90_z1684symm [ 83%] Built target cutlass_library_gemm_sm50_cgemm [ 83%] Linking CUDA shared library libcutlass_gemm_sm50_dgemm.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm50_sgemm.so [ 83%] Built target cutlass_library_gemm_sm50_dgemm [ 83%] Built target cutlass_library_gemm_sm50_sgemm [ 83%] Linking CUDA shared library libcutlass_gemm_sm60_hgemm.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm61_igemm_s8.so [ 83%] Built target cutlass_library_gemm_sm61_igemm_s8 [ 83%] Built target cutlass_library_gemm_sm60_hgemm [ 83%] Linking CUDA shared library libcutlass_gemm_sm61_s8_igemm_s8.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm70_f16_s884gemm_f16.so [ 83%] Built target cutlass_library_gemm_sm70_f16_s884gemm_f16 [ 83%] Built target cutlass_library_gemm_sm61_s8_igemm_s8 [ 83%] Linking CUDA shared library libcutlass_gemm_sm70_f16_s884gemm_planar_complex_array_f16.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm70_f16_s884gemm_planar_complex_f16.so [ 83%] Built target cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16 [ 83%] Built target cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm70_h884gemm.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm70_h884gemm_planar_complex.so [ 83%] Built target cutlass_library_gemm_sm70_h884gemm [ 83%] Built target cutlass_library_gemm_sm70_h884gemm_planar_complex [ 83%] Linking CUDA shared library libcutlass_gemm_sm70_h884gemm_planar_complex_array.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm70_s884gemm_f16.so [ 83%] Built target cutlass_library_gemm_sm70_h884gemm_planar_complex_array [ 83%] Built target cutlass_library_gemm_sm70_s884gemm_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm70_s884gemm_planar_complex_array_f16.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm70_s884gemm_planar_complex_f16.so [ 83%] Built target cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16 [ 83%] Built target cutlass_library_gemm_sm70_s884gemm_planar_complex_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_f16_s1688gemm_f16.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_array_f16.so [ 83%] Built target cutlass_library_gemm_sm75_f16_s1688gemm_f16 [ 83%] Built target cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_f16.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_h1688gemm.so [ 83%] Built target cutlass_library_gemm_sm75_h1688gemm [ 83%] Built target cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_h1688gemm_planar_complex.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_h1688gemm_planar_complex_array.so [ 83%] Built target cutlass_library_gemm_sm75_h1688gemm_planar_complex [ 83%] Built target cutlass_library_gemm_sm75_h1688gemm_planar_complex_array [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_i88128xorgemm_b1.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_i8816gemm_s8.so [ 83%] Built target cutlass_library_gemm_sm75_i8816gemm_s8 [ 83%] Built target cutlass_library_gemm_sm75_i88128xorgemm_b1 [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_i8832gemm_s4.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_i8816gemm_u8.so [ 83%] Built target cutlass_library_gemm_sm75_i8816gemm_u8 [ 83%] Built target cutlass_library_gemm_sm75_i8832gemm_s4 [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_i8832gemm_u4.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_s1688gemm_f16.so [ 83%] Built target cutlass_library_gemm_sm75_i8832gemm_u4 [ 83%] Built target cutlass_library_gemm_sm75_s1688gemm_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_s1688gemm_planar_complex_array_f16.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_s1688gemm_planar_complex_f16.so [ 83%] Built target cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16 [ 83%] Built target cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_s4_i8832gemm_s4.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_s8_i8816gemm_s8.so [ 83%] Built target cutlass_library_gemm_sm75_s4_i8832gemm_s4 [ 83%] Built target cutlass_library_gemm_sm75_s8_i8816gemm_s8 [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_u8_i8816gemm_u8.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_u4_i8832gemm_u4.so [ 83%] Built target cutlass_library_gemm_sm75_u4_i8832gemm_u4 [ 83%] Built target cutlass_library_gemm_sm75_u8_i8816gemm_u8 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_bf16_s16816gemm_bf16_s8.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_bf16_s16816gemm_bf16.so [ 83%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_s8 [ 83%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_bf16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_bf16_s16816gemm_bf16_u8.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16.so [ 83%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_u8 [ 83%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_bf16.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_bf16_s16816gemm_s8_bf16.so [ 83%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16 [ 83%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_s8_bf16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_bf16_s16816gemm_u8_bf16.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_bf16_s16832spgemm_bf16.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_c1688gemm.so [ 83%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_u8_bf16 [ 83%] Built target cutlass_library_gemm_sm80_bf16_s16832spgemm_bf16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_c1688tf32gemm.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_cgemm.so [ 83%] Built target cutlass_library_gemm_sm80_c1688gemm [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_d884gemm.so [ 83%] Built target cutlass_library_gemm_sm80_c1688tf32gemm [ 83%] Built target cutlass_library_gemm_sm80_cgemm [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_dgemm.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_f16_s16816gemm_f16.so [ 83%] Built target cutlass_library_gemm_sm80_d884gemm [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_f16_s16816gemm_f16_s8.so [ 83%] Built target cutlass_library_gemm_sm80_dgemm [ 83%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_f16_s16816gemm_f16_u8.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_array_f16.so [ 83%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_f16_s8 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_f16.so [ 83%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_f16_u8 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_f16_s16816gemm_s8_f16.so [ 83%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_f16_s16816gemm_u8_f16.so [ 83%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16 [ 83%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_s8_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_f16_s16832spgemm_f16.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_gz884gemm.so [ 83%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_u8_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_h16816gemm.so [ 83%] Built target cutlass_library_gemm_sm80_f16_s16832spgemm_f16 [ 83%] Built target cutlass_library_gemm_sm80_gz884gemm [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_h16816gemm_grouped.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_h16816gemm_planar_complex.so [ 83%] Built target cutlass_library_gemm_sm80_h16816gemm [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_h16816gemm_planar_complex_array.so [ 83%] Built target cutlass_library_gemm_sm80_h16816gemm_grouped [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_h16816gemm_s8_f16.so [ 83%] Built target cutlass_library_gemm_sm80_h16816gemm_planar_complex [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_h16832spgemm.so [ 83%] Built target cutlass_library_gemm_sm80_h16816gemm_planar_complex_array [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_i168128spgemm_s4.so [ 83%] Built target cutlass_library_gemm_sm80_h16816gemm_s8_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_i168256andgemm_b1.so [ 83%] Built target cutlass_library_gemm_sm80_h16832spgemm [ 83%] Built target cutlass_library_gemm_sm80_i168128spgemm_s4 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_i168256xorgemm_b1.so [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_i16832gemm_s8.so [ 83%] Built target cutlass_library_gemm_sm80_i168256andgemm_b1 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_i16832gemm_u8.so [ 83%] Built target cutlass_library_gemm_sm80_i168256xorgemm_b1 [ 83%] Built target cutlass_library_gemm_sm80_i16832gemm_s8 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_i16864gemm_s4.so [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_i16864gemm_u4.so [ 84%] Built target cutlass_library_gemm_sm80_i16832gemm_u8 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_i16864spgemm_s8.so [ 84%] Built target cutlass_library_gemm_sm80_i16864gemm_s4 [ 84%] Built target cutlass_library_gemm_sm80_i16864gemm_u4 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_bf16.so [ 84%] Built target cutlass_library_gemm_sm80_i16864spgemm_s8 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_bf16_s8.so [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_bf16_u8.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_bf16 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_f16.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_bf16_u8 [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_bf16_s8 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_f16_s8.so [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_f16_u8.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_f16 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_grouped_bf16.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_f16_s8 [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_f16_u8 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_grouped_f16.so [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_planar_complex_array_bf16.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_grouped_bf16 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_planar_complex_array_f16.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_grouped_f16 [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_planar_complex_bf16.so [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_planar_complex_f16.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_s8_bf16.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16 [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_s8_f16.so [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_u8_bf16.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_s8_bf16 [ 84%] Linking CUDA shared library libcutlass_conv2d_sm75_cf32_cfprop_optimized_cf32.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_s8_f16 [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_u8_bf16 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_u8_f16.so [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816tf32spgemm.so [ 84%] Built target cutlass_library_conv2d_sm75_cf32_cfprop_optimized_cf32 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16832spgemm_bf16.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_u8_f16 [ 84%] Built target cutlass_library_gemm_sm80_s16816tf32spgemm [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16832spgemm_f16.so [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s1688bf16gemm.so [ 84%] Built target cutlass_library_gemm_sm80_s16832spgemm_bf16 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s1688f16gemm.so [ 84%] Built target cutlass_library_gemm_sm80_s16832spgemm_f16 [ 84%] Built target cutlass_library_gemm_sm80_s1688bf16gemm [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s1688gemm.so [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s1688gemm_tf32.so [ 84%] Built target cutlass_library_gemm_sm80_s1688f16gemm [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s1688tf32gemm.so [ 84%] Built target cutlass_library_gemm_sm80_s1688gemm [ 84%] Built target cutlass_library_gemm_sm80_s1688gemm_tf32 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s4_i168128spgemm_s4.so [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s4_i16864gemm_s4.so [ 84%] Built target cutlass_library_gemm_sm80_s1688tf32gemm [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s8_i16832gemm_s8.so [ 84%] Built target cutlass_library_gemm_sm80_s4_i168128spgemm_s4 [ 84%] Built target cutlass_library_gemm_sm80_s4_i16864gemm_s4 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s8_i16864spgemm_s8.so [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_sgemm.so [ 84%] Built target cutlass_library_gemm_sm80_s8_i16832gemm_s8 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_tf32_s1688gemm_tf32.so [ 84%] Built target cutlass_library_gemm_sm80_s8_i16864spgemm_s8 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_u4_i16864gemm_u4.so [ 84%] Built target cutlass_library_gemm_sm80_sgemm [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_u8_i16832gemm_u8.so [ 84%] Built target cutlass_library_gemm_sm80_tf32_s1688gemm_tf32 [ 84%] Built target cutlass_library_gemm_sm80_u4_i16864gemm_u4 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_z884gemm.so [ 84%] Linking CUDA shared library libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3.so [ 84%] Built target cutlass_library_gemm_sm80_u8_i16832gemm_u8 [ 84%] Linking CUDA shared library libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3_e5m2.so [ 84%] Built target cutlass_library_gemm_sm89_s16832fastaccumgemm_e4m3 [ 84%] Built target cutlass_library_gemm_sm80_z884gemm [ 84%] Linking CUDA shared library libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2.so [ 84%] Linking CUDA shared library libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2_e4m3.so [ 84%] Built target cutlass_library_gemm_sm89_s16832fastaccumgemm_e4m3_e5m2 [ 84%] Linking CUDA shared library libcutlass_gemm_sm89_s16832gemm_e4m3.so [ 84%] Built target cutlass_library_gemm_sm89_s16832fastaccumgemm_e5m2_e4m3 [ 84%] Built target cutlass_library_gemm_sm89_s16832fastaccumgemm_e5m2 [ 84%] Linking CUDA shared library libcutlass_gemm_sm89_s16832gemm_e4m3_e5m2.so [ 85%] Linking CUDA shared library libcutlass_gemm_sm89_s16832gemm_e5m2.so [ 85%] Built target cutlass_library_gemm_sm89_s16832gemm_e4m3 [ 85%] Linking CUDA shared library libcutlass_gemm_sm89_s16832gemm_e5m2_e4m3.so [ 85%] Built target cutlass_library_gemm_sm89_s16832gemm_e4m3_e5m2 [ 85%] Built target cutlass_library_gemm_sm89_s16832gemm_e5m2 [ 85%] Linking CUDA shared library libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3.so [ 85%] Linking CUDA shared library libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3_e5m2.so [ 85%] Built target cutlass_library_gemm_sm89_s16832gemm_e5m2_e4m3 [ 85%] Linking CUDA shared library libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2.so [ 85%] Built target cutlass_library_gemm_sm89_s16864fastaccumspgemm_e4m3 [ 85%] Built target cutlass_library_gemm_sm89_s16864fastaccumspgemm_e4m3_e5m2 [ 85%] Linking CUDA shared library libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2_e4m3.so [ 85%] Linking CUDA shared library libcutlass_gemm_sm89_s16864spgemm_e4m3.so [ 85%] Built target cutlass_library_gemm_sm89_s16864fastaccumspgemm_e5m2 [ 85%] Linking CUDA shared library libcutlass_gemm_sm89_s16864spgemm_e4m3_e5m2.so [ 85%] Built target cutlass_library_gemm_sm89_s16864fastaccumspgemm_e5m2_e4m3 [ 85%] Built target cutlass_library_gemm_sm89_s16864spgemm_e4m3 [ 85%] Linking CUDA shared library libcutlass_gemm_sm89_s16864spgemm_e5m2.so [ 85%] Linking CUDA shared library libcutlass_gemm_sm89_s16864spgemm_e5m2_e4m3.so [ 85%] Built target cutlass_library_gemm_sm89_s16864spgemm_e4m3_e5m2 [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_bf16_s64x128x16gemm_bf16.so [ 85%] Built target cutlass_library_gemm_sm89_s16864spgemm_e5m2 [ 85%] Built target cutlass_library_gemm_sm89_s16864spgemm_e5m2_e4m3 [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3.so [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2.so [ 85%] Built target cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16 [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2.so [ 85%] Built target cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3 [ 85%] Built target cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2 [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_d1684gemm.so [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3.so [ 85%] Built target cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2 [ 85%] Built target cutlass_library_gemm_sm90_d1684gemm [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_f16_s64x128x16gemm_f16.so [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3.so [ 85%] Built target cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3 [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2.so [ 85%] Built target cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16 [ 85%] Built target cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3 [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2.so [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3.so [ 85%] Built target cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2 [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_gz1684gemm.so [ 85%] Built target cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2 [ 85%] Built target cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3 [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_h64x128x16gemm.so [ 85%] Built target cutlass_library_gemm_sm90_gz1684gemm [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_i64x128x32gemm_s8.so [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_i64x128x32gemm_u8.so [ 86%] Built target cutlass_library_gemm_sm90_i64x128x32gemm_s8 [ 86%] Built target cutlass_library_gemm_sm90_i64x128x32gemm_u8 [ 86%] Built target cutlass_library_gemm_sm90_h64x128x16gemm [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_s64x128x16gemm_bf16.so [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_s64x128x16gemm_f16.so [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_s64x128x32gemm_e4m3.so [ 86%] Built target cutlass_library_gemm_sm90_s64x128x16gemm_bf16 [ 86%] Built target cutlass_library_gemm_sm90_s64x128x16gemm_f16 [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_s64x128x32gemm_e4m3_e5m2.so [ 86%] Built target cutlass_library_gemm_sm90_s64x128x32gemm_e4m3 [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_s64x128x32gemm_e5m2.so [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_s64x128x32gemm_e5m2_e4m3.so [ 86%] Built target cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2 [ 86%] Built target cutlass_library_gemm_sm90_s64x128x32gemm_e5m2 [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_s64x128x8gemm_tf32.so [ 86%] Built target cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3 [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_s64x128x8tf32gemm.so [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_s8_i64x128x32gemm_s8.so [ 86%] Built target cutlass_library_gemm_sm90_s64x128x8gemm_tf32 [ 86%] Built target cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8 [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_s8_i64x128x32gemm_u8.so [ 86%] Built target cutlass_library_gemm_sm90_s64x128x8tf32gemm [ 87%] Linking CUDA shared library libcutlass_gemm_sm90_void_i64x128x32gemm_s8.so [ 87%] Linking CUDA shared library libcutlass_gemm_sm90_void_i64x128x32gemm_u8.so [ 87%] Built target cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8 [ 87%] Built target cutlass_library_gemm_sm90_void_i64x128x32gemm_s8 [ 87%] Built target cutlass_library_gemm_sm90_void_i64x128x32gemm_u8 [ 87%] Linking CUDA shared library libcutlass_gemm_sm90_void_s64x128x16gemm_bf16.so [ 87%] Linking CUDA shared library libcutlass_gemm_sm90_void_s64x128x16gemm_f16.so [ 88%] Linking CUDA shared library libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3.so [ 88%] Built target cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16 [ 88%] Built target cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3 [ 88%] Built target cutlass_library_gemm_sm90_void_s64x128x16gemm_f16 [ 88%] Linking CUDA shared library libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2.so [ 88%] Linking CUDA shared library libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2.so [ 88%] Linking CUDA shared library libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3.so [ 88%] Built target cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2 [ 88%] Built target cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2 [ 88%] Built target cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3 [ 88%] Linking CUDA shared library libcutlass_conv2d_sm50_cf32_cfprop_optimized_cf32.so [ 88%] Linking CUDA shared library libcutlass_conv2d_sm50_cf32_cdgrad_optimized_cf32.so [ 88%] Linking CUDA shared library libcutlass_gemm_sm90_z1684gemm.so [ 88%] Built target cutlass_library_conv2d_sm50_cf32_cfprop_optimized_cf32 [ 88%] Built target cutlass_library_conv2d_sm50_cf32_cdgrad_optimized_cf32 [ 88%] Built target cutlass_library_gemm_sm90_z1684gemm [ 88%] Linking CUDA shared library libcutlass_conv2d_sm50_cf32_cwgrad_optimized_cf32.so [ 88%] Linking CUDA shared library libcutlass_conv2d_sm50_sdgrad_optimized.so [ 88%] Linking CUDA shared library libcutlass_conv2d_sm50_sfprop_optimized.so [ 88%] Built target cutlass_library_conv2d_sm50_sdgrad_optimized [ 88%] Built target cutlass_library_conv2d_sm50_cf32_cwgrad_optimized_cf32 [ 88%] Built target cutlass_library_conv2d_sm50_sfprop_optimized [ 88%] Linking CUDA shared library libcutlass_conv2d_sm50_swgrad_optimized.so [ 88%] Linking CUDA shared library libcutlass_conv2d_sm60_hfprop_optimized.so [ 88%] Linking CUDA shared library libcutlass_conv2d_sm70_f16_s884dgrad_optimized_f16.so [ 88%] Built target cutlass_library_conv2d_sm60_hfprop_optimized [ 88%] Built target cutlass_library_conv2d_sm50_swgrad_optimized [ 89%] Built target cutlass_library_conv2d_sm70_f16_s884dgrad_optimized_f16 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm70_f16_s884fprop_optimized_f16.so [ 89%] Linking CUDA shared library libcutlass_conv2d_sm70_f16_s884wgrad_optimized_f16.so [ 89%] Linking CUDA shared library libcutlass_conv2d_sm70_h884dgrad_optimized.so [ 89%] Built target cutlass_library_conv2d_sm70_f16_s884wgrad_optimized_f16 [ 89%] Built target cutlass_library_conv2d_sm70_f16_s884fprop_optimized_f16 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm70_h884wgrad_optimized.so [ 89%] Linking CUDA shared library libcutlass_conv2d_sm70_h884fprop_optimized.so [ 89%] Built target cutlass_library_conv2d_sm70_h884dgrad_optimized [ 89%] Linking CUDA shared library libcutlass_conv2d_sm70_s884dgrad_optimized_f16.so [ 89%] Built target cutlass_library_conv2d_sm70_h884fprop_optimized [ 89%] Built target cutlass_library_conv2d_sm70_h884wgrad_optimized [ 89%] Linking CUDA shared library libcutlass_conv2d_sm70_s884wgrad_optimized_f16.so [ 89%] Linking CUDA shared library libcutlass_conv2d_sm70_s884fprop_optimized_f16.so [ 89%] Built target cutlass_library_conv2d_sm70_s884dgrad_optimized_f16 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_cf32_cdgrad_optimized_cf32.so [ 89%] Built target cutlass_library_conv2d_sm70_s884fprop_optimized_f16 [ 89%] Built target cutlass_library_conv2d_sm70_s884wgrad_optimized_f16 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_cf32_cwgrad_optimized_cf32.so [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_f16_s1688dgrad_optimized_f16.so [ 89%] Built target cutlass_library_conv2d_sm75_cf32_cdgrad_optimized_cf32 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_f16_s1688fprop_few_channels_f16.so [ 89%] Built target cutlass_library_conv2d_sm75_cf32_cwgrad_optimized_cf32 [ 89%] Built target cutlass_library_conv2d_sm75_f16_s1688dgrad_optimized_f16 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_f16_s1688fprop_fixed_channels_f16.so [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_f16_s1688fprop_optimized_f16.so [ 89%] Built target cutlass_library_conv2d_sm75_f16_s1688fprop_few_channels_f16 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_f16_s1688wgrad_optimized_f16.so [ 89%] Built target cutlass_library_conv2d_sm75_f16_s1688fprop_fixed_channels_f16 [ 89%] Built target cutlass_library_conv2d_sm75_f16_s1688fprop_optimized_f16 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_h1688dgrad_optimized.so [ 89%] Built target cutlass_library_conv2d_sm75_f16_s1688wgrad_optimized_f16 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_h1688fprop_few_channels.so [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_h1688fprop_fixed_channels.so [ 89%] Built target cutlass_library_conv2d_sm75_h1688fprop_few_channels [ 89%] Built target cutlass_library_conv2d_sm75_h1688dgrad_optimized [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_h1688fprop_optimized.so [ 89%] Built target cutlass_library_conv2d_sm75_h1688fprop_fixed_channels [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_h1688wgrad_optimized.so [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_i8816fprop_optimized_s8.so [ 89%] Built target cutlass_library_conv2d_sm75_h1688wgrad_optimized [ 89%] Built target cutlass_library_conv2d_sm75_h1688fprop_optimized [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_i8816fprop_optimized_u8.so [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_i8832fprop_optimized_s4.so [ 89%] Built target cutlass_library_conv2d_sm75_i8816fprop_optimized_s8 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_i8832fprop_optimized_u4.so [ 89%] Built target cutlass_library_conv2d_sm75_i8832fprop_optimized_s4 [ 89%] Built target cutlass_library_conv2d_sm75_i8816fprop_optimized_u8 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_s1688fprop_few_channels_f16.so [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_s1688dgrad_optimized_f16.so [ 89%] Built target cutlass_library_conv2d_sm75_i8832fprop_optimized_u4 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_s1688fprop_fixed_channels_f16.so [ 89%] Built target cutlass_library_conv2d_sm75_s1688fprop_few_channels_f16 [ 89%] Built target cutlass_library_conv2d_sm75_s1688dgrad_optimized_f16 [ 89%] Built target cutlass_library_conv2d_sm75_s1688fprop_fixed_channels_f16 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_s1688fprop_optimized_f16.so [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_s1688wgrad_optimized_f16.so [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_s4_i8832fprop_optimized_s4.so [ 89%] Built target cutlass_library_conv2d_sm75_s1688fprop_optimized_f16 [ 89%] Built target cutlass_library_conv2d_sm75_s1688wgrad_optimized_f16 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_s8_i8816fprop_few_channels_s8.so [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_s8_i8816fprop_fixed_channels_s8.so [ 89%] Built target cutlass_library_conv2d_sm75_s4_i8832fprop_optimized_s4 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_s8_i8816fprop_optimized_s8.so [ 89%] Built target cutlass_library_conv2d_sm75_s8_i8816fprop_few_channels_s8 [ 89%] Built target cutlass_library_conv2d_sm75_s8_i8816fprop_fixed_channels_s8 [ 90%] Linking CUDA shared library libcutlass_conv2d_sm75_u4_i8832fprop_optimized_u4.so [ 90%] Linking CUDA shared library libcutlass_conv2d_sm75_u8_i8816fprop_few_channels_u8.so [ 90%] Built target cutlass_library_conv2d_sm75_s8_i8816fprop_optimized_s8 [ 90%] Linking CUDA shared library libcutlass_conv2d_sm75_u8_i8816fprop_fixed_channels_u8.so [ 90%] Built target cutlass_library_conv2d_sm75_u4_i8832fprop_optimized_u4 [ 90%] Built target cutlass_library_conv2d_sm75_u8_i8816fprop_few_channels_u8 [ 90%] Linking CUDA shared library libcutlass_conv2d_sm75_u8_i8816fprop_optimized_u8.so [ 90%] Linking CUDA shared library libcutlass_conv2d_sm80_bf16_s16816dgrad_optimized_bf16.so [ 90%] Built target cutlass_library_conv2d_sm75_u8_i8816fprop_fixed_channels_u8 [ 90%] Linking CUDA shared library libcutlass_conv2d_sm80_bf16_s16816fprop_fixed_channels_bf16.so [ 90%] Built target cutlass_library_conv2d_sm75_u8_i8816fprop_optimized_u8 [ 90%] Built target cutlass_library_conv2d_sm80_bf16_s16816dgrad_optimized_bf16 [ 90%] Built target cutlass_library_conv2d_sm80_bf16_s16816fprop_fixed_channels_bf16 [ 90%] Linking CUDA shared library libcutlass_conv2d_sm80_bf16_s16816fprop_optimized_bf16.so [ 90%] Linking CUDA shared library libcutlass_conv2d_sm80_bf16_s16816wgrad_optimized_bf16.so [ 91%] Linking CUDA shared library libcutlass_conv2d_sm80_f16_s16816dgrad_optimized_f16.so [ 91%] Built target cutlass_library_conv2d_sm80_bf16_s16816wgrad_optimized_bf16 [ 91%] Built target cutlass_library_conv2d_sm80_bf16_s16816fprop_optimized_bf16 [ 91%] Linking CUDA shared library libcutlass_conv2d_sm80_f16_s16816fprop_optimized_f16.so [ 91%] Linking CUDA shared library libcutlass_conv2d_sm80_f16_s16816fprop_fixed_channels_f16.so [ 91%] Built target cutlass_library_conv2d_sm80_f16_s16816dgrad_optimized_f16 [ 91%] Linking CUDA shared library libcutlass_conv2d_sm80_f16_s16816wgrad_optimized_f16.so [ 91%] Built target cutlass_library_conv2d_sm80_f16_s16816fprop_fixed_channels_f16 [ 91%] Built target cutlass_library_conv2d_sm80_f16_s16816fprop_optimized_f16 [ 91%] Linking CUDA shared library libcutlass_conv2d_sm80_h16816dgrad_optimized.so [ 91%] Linking CUDA shared library libcutlass_conv2d_sm80_h16816fprop_fixed_channels.so [ 91%] Built target cutlass_library_conv2d_sm80_f16_s16816wgrad_optimized_f16 [ 91%] Linking CUDA shared library libcutlass_conv2d_sm80_h16816fprop_optimized.so [ 91%] Built target cutlass_library_conv2d_sm80_h16816fprop_fixed_channels [ 91%] Built target cutlass_library_conv2d_sm80_h16816dgrad_optimized [ 91%] Linking CUDA shared library libcutlass_conv2d_sm80_h16816wgrad_optimized.so [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_i16832fprop_optimized_s8.so [ 92%] Built target cutlass_library_conv2d_sm80_h16816fprop_optimized [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_i16832fprop_optimized_u8.so [ 92%] Built target cutlass_library_conv2d_sm80_i16832fprop_optimized_s8 [ 92%] Built target cutlass_library_conv2d_sm80_h16816wgrad_optimized [ 92%] Built target cutlass_library_conv2d_sm80_i16832fprop_optimized_u8 [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_i16864fprop_optimized_s4.so [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_i16864fprop_optimized_u4.so [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s16816dgrad_optimized_bf16.so [ 92%] Built target cutlass_library_conv2d_sm80_i16864fprop_optimized_s4 [ 92%] Built target cutlass_library_conv2d_sm80_i16864fprop_optimized_u4 [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s16816dgrad_optimized_f16.so [ 92%] Built target cutlass_library_conv2d_sm80_s16816dgrad_optimized_bf16 [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s16816fprop_fixed_channels_bf16.so [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s16816fprop_fixed_channels_f16.so [ 92%] Built target cutlass_library_conv2d_sm80_s16816dgrad_optimized_f16 [ 92%] Built target cutlass_library_conv2d_sm80_s16816fprop_fixed_channels_bf16 [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s16816fprop_optimized_bf16.so [ 92%] Built target cutlass_library_conv2d_sm80_s16816fprop_fixed_channels_f16 [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s16816fprop_optimized_f16.so [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s16816wgrad_optimized_bf16.so [ 92%] Built target cutlass_library_conv2d_sm80_s16816fprop_optimized_bf16 [ 92%] Built target cutlass_library_conv2d_sm80_s16816fprop_optimized_f16 [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s16816wgrad_optimized_f16.so [ 92%] Built target cutlass_library_conv2d_sm80_s16816wgrad_optimized_bf16 [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688bf16dgrad_optimized.so [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688bf16fprop_optimized.so [ 92%] Built target cutlass_library_conv2d_sm80_s16816wgrad_optimized_f16 [ 92%] Built target cutlass_library_conv2d_sm80_s1688bf16dgrad_optimized [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688bf16wgrad_optimized.so [ 92%] Built target cutlass_library_conv2d_sm80_s1688bf16fprop_optimized [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688dgrad_optimized.so [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688dgrad_optimized_tf32.so [ 92%] Built target cutlass_library_conv2d_sm80_s1688bf16wgrad_optimized [ 92%] Built target cutlass_library_conv2d_sm80_s1688dgrad_optimized [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688f16dgrad_optimized.so [ 92%] Built target cutlass_library_conv2d_sm80_s1688dgrad_optimized_tf32 [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688f16fprop_optimized.so [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688f16wgrad_optimized.so [ 92%] Built target cutlass_library_conv2d_sm80_s1688f16dgrad_optimized [ 92%] Built target cutlass_library_conv2d_sm80_s1688f16fprop_optimized [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688fprop_optimized_tf32.so [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688fprop_optimized.so [ 92%] Built target cutlass_library_conv2d_sm80_s1688f16wgrad_optimized [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688tf32dgrad_optimized.so [ 92%] Built target cutlass_library_conv2d_sm80_s1688fprop_optimized_tf32 [ 92%] Built target cutlass_library_conv2d_sm80_s1688fprop_optimized [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688tf32fprop_optimized.so [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688tf32wgrad_optimized.so [ 93%] Built target cutlass_library_conv2d_sm80_s1688tf32dgrad_optimized [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688wgrad_optimized.so [ 93%] Built target cutlass_library_conv2d_sm80_s1688tf32wgrad_optimized [ 93%] Built target cutlass_library_conv2d_sm80_s1688tf32fprop_optimized [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_s4_i16864fprop_optimized_s4.so [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688wgrad_optimized_tf32.so [ 93%] Built target cutlass_library_conv2d_sm80_s1688wgrad_optimized [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_s8_i16832fprop_few_channels_s8.so [ 93%] Built target cutlass_library_conv2d_sm80_s4_i16864fprop_optimized_s4 [ 93%] Built target cutlass_library_conv2d_sm80_s1688wgrad_optimized_tf32 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_s8_i16832fprop_fixed_channels_s8.so [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_s8_i16832fprop_optimized_s8.so [ 93%] Built target cutlass_library_conv2d_sm80_s8_i16832fprop_few_channels_s8 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_sdgrad_optimized.so [ 93%] Built target cutlass_library_conv2d_sm80_s8_i16832fprop_fixed_channels_s8 [ 93%] Built target cutlass_library_conv2d_sm80_s8_i16832fprop_optimized_s8 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_sfprop_optimized.so [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_swgrad_optimized.so [ 93%] Built target cutlass_library_conv2d_sm80_sdgrad_optimized [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_tf32_s1688dgrad_optimized_tf32.so [ 93%] Built target cutlass_library_conv2d_sm80_swgrad_optimized [ 93%] Built target cutlass_library_conv2d_sm80_sfprop_optimized [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_tf32_s1688fprop_optimized_tf32.so [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_tf32_s1688wgrad_optimized_tf32.so [ 93%] Built target cutlass_library_conv2d_sm80_tf32_s1688dgrad_optimized_tf32 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_u4_i16864fprop_optimized_u4.so [ 93%] Built target cutlass_library_conv2d_sm80_tf32_s1688wgrad_optimized_tf32 [ 93%] Built target cutlass_library_conv2d_sm80_tf32_s1688fprop_optimized_tf32 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_u8_i16832fprop_fixed_channels_u8.so [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_u8_i16832fprop_few_channels_u8.so [ 93%] Built target cutlass_library_conv2d_sm80_u4_i16864fprop_optimized_u4 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_u8_i16832fprop_optimized_u8.so [ 93%] Built target cutlass_library_conv2d_sm80_u8_i16832fprop_fixed_channels_u8 [ 93%] Built target cutlass_library_conv2d_sm80_u8_i16832fprop_few_channels_u8 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e4m3.so [ 93%] Linking CUDA shared library libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e5m2.so [ 93%] Built target cutlass_library_conv2d_sm80_u8_i16832fprop_optimized_u8 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm89_s16832fprop_optimized_e4m3.so [ 93%] Built target cutlass_library_conv2d_sm89_s16832fprop_fixed_channels_e4m3 [ 93%] Built target cutlass_library_conv2d_sm89_s16832fprop_fixed_channels_e5m2 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm89_s16832fprop_optimized_e5m2.so [ 93%] Linking CUDA shared library libcutlass_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16.so [ 93%] Built target cutlass_library_conv2d_sm89_s16832fprop_optimized_e4m3 [ 93%] Built target cutlass_library_conv2d_sm89_s16832fprop_optimized_e5m2 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16.so [ 93%] Built target cutlass_library_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32.so [ 93%] Linking CUDA shared library libcutlass_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32.so [ 93%] Built target cutlass_library_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16 [ 93%] Built target cutlass_library_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32.so [ 93%] Linking CUDA shared library libcutlass_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32.so [ 93%] Built target cutlass_library_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32 [ 93%] Linking CUDA shared library libcutlass_conv3d_sm80_bf16_s16816dgrad3d_analytic_bf16.so [ 93%] Built target cutlass_library_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32 [ 93%] Built target cutlass_library_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32 [ 93%] Linking CUDA shared library libcutlass_conv3d_sm80_bf16_s16816dgrad3d_optimized_bf16.so [ 93%] Linking CUDA shared library libcutlass_conv3d_sm80_bf16_s16816fprop3d_optimized_bf16.so [ 93%] Built target cutlass_library_conv3d_sm80_bf16_s16816dgrad3d_analytic_bf16 [ 94%] Linking CUDA shared library libcutlass_conv3d_sm80_bf16_s16816wgrad3d_optimized_bf16.so [ 94%] Built target cutlass_library_conv3d_sm80_bf16_s16816dgrad3d_optimized_bf16 [ 94%] Built target cutlass_library_conv3d_sm80_bf16_s16816fprop3d_optimized_bf16 [ 94%] Linking CUDA shared library libcutlass_conv3d_sm80_f16_s16816dgrad3d_analytic_f16.so [ 94%] Built target cutlass_library_conv3d_sm80_bf16_s16816wgrad3d_optimized_bf16 [ 94%] Linking CUDA shared library libcutlass_conv3d_sm80_f16_s16816dgrad3d_optimized_f16.so [ 94%] Linking CUDA shared library libcutlass_conv3d_sm80_f16_s16816fprop3d_optimized_f16.so [ 94%] Built target cutlass_library_conv3d_sm80_f16_s16816dgrad3d_analytic_f16 [ 94%] Built target cutlass_library_conv3d_sm80_f16_s16816dgrad3d_optimized_f16 [ 94%] Built target cutlass_library_conv3d_sm80_f16_s16816fprop3d_optimized_f16 [ 94%] Linking CUDA shared library libcutlass_conv3d_sm80_h16816dgrad3d_analytic.so [ 94%] Linking CUDA shared library libcutlass_conv3d_sm80_f16_s16816wgrad3d_optimized_f16.so [ 94%] Linking CUDA shared library libcutlass_conv3d_sm80_h16816dgrad3d_optimized.so [ 94%] Built target cutlass_library_conv3d_sm80_f16_s16816wgrad3d_optimized_f16 [ 94%] Built target cutlass_library_conv3d_sm80_h16816dgrad3d_analytic [ 94%] Linking CUDA shared library libcutlass_conv3d_sm80_h16816fprop3d_optimized.so [ 94%] Linking CUDA shared library libcutlass_conv3d_sm80_h16816wgrad3d_optimized.so [ 94%] Built target cutlass_library_conv3d_sm80_h16816dgrad3d_optimized [ 95%] Linking CUDA shared library libcutlass_conv3d_sm80_s16816dgrad3d_analytic_bf16.so [ 95%] Built target cutlass_library_conv3d_sm80_h16816fprop3d_optimized [ 95%] Built target cutlass_library_conv3d_sm80_h16816wgrad3d_optimized [ 95%] Linking CUDA shared library libcutlass_conv3d_sm80_s16816dgrad3d_optimized_bf16.so [ 95%] Linking CUDA shared library libcutlass_conv3d_sm80_s16816dgrad3d_analytic_f16.so [ 95%] Built target cutlass_library_conv3d_sm80_s16816dgrad3d_analytic_bf16 [ 95%] Linking CUDA shared library libcutlass_conv3d_sm80_s16816dgrad3d_optimized_f16.so [ 95%] Built target cutlass_library_conv3d_sm80_s16816dgrad3d_optimized_bf16 [ 95%] Built target cutlass_library_conv3d_sm80_s16816dgrad3d_analytic_f16 [ 95%] Linking CUDA shared library libcutlass_conv3d_sm80_s16816fprop3d_optimized_bf16.so [ 95%] Linking CUDA shared library libcutlass_conv3d_sm80_s16816fprop3d_optimized_f16.so [ 95%] Built target cutlass_library_conv3d_sm80_s16816dgrad3d_optimized_f16 [ 95%] Linking CUDA shared library libcutlass_conv3d_sm80_s16816wgrad3d_optimized_bf16.so [ 95%] Built target cutlass_library_conv3d_sm80_s16816fprop3d_optimized_bf16 [ 95%] Linking CUDA shared library libcutlass_conv3d_sm80_s16816wgrad3d_optimized_f16.so [ 95%] Built target cutlass_library_conv3d_sm80_s16816fprop3d_optimized_f16 [ 95%] Linking CUDA shared library libcutlass_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16.so [ 95%] Built target cutlass_library_conv3d_sm80_s16816wgrad3d_optimized_bf16 [ 96%] Linking CUDA shared library libcutlass_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16.so [ 96%] Built target cutlass_library_conv3d_sm80_s16816wgrad3d_optimized_f16 [ 96%] Linking CUDA shared library libcutlass_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32.so [ 96%] Built target cutlass_library_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16 [ 96%] Linking CUDA shared library libcutlass_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32.so [ 96%] Built target cutlass_library_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16 [ 96%] Linking CUDA shared library libcutlass_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32.so [ 96%] Built target cutlass_library_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32 [ 96%] Linking CUDA shared library libcutlass_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32.so [ 96%] Built target cutlass_library_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32 [ 96%] Linking CUDA shared library libcutlass_rank_k_sm80_c1688herk.so [ 96%] Built target cutlass_library_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32 [ 96%] Linking CUDA shared library libcutlass_rank_k_sm80_c1688syrk.so [ 96%] Built target cutlass_library_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32 [ 96%] Linking CUDA shared library libcutlass_rank_k_sm80_c1688tf32herk.so [ 96%] Built target cutlass_library_rank_k_sm80_c1688herk [ 96%] Linking CUDA shared library libcutlass_rank_k_sm80_c1688tf32syrk.so [ 96%] Built target cutlass_library_rank_k_sm80_c1688syrk [ 96%] Built target cutlass_library_rank_k_sm80_c1688tf32herk [ 96%] Linking CUDA shared library libcutlass_rank_k_sm80_d884syrk.so [ 96%] Linking CUDA shared library libcutlass_rank_k_sm80_gz884herk.so [ 96%] Built target cutlass_library_rank_k_sm80_c1688tf32syrk [ 96%] Linking CUDA shared library libcutlass_rank_k_sm80_gz884syrk.so [ 96%] Built target cutlass_library_rank_k_sm80_d884syrk [ 96%] Built target cutlass_library_rank_k_sm80_gz884herk [ 96%] Linking CUDA shared library libcutlass_rank_k_sm80_s1688syrk.so [ 96%] Linking CUDA shared library libcutlass_rank_k_sm80_s1688tf32syrk.so [ 96%] Built target cutlass_library_rank_k_sm80_gz884syrk [ 96%] Linking CUDA shared library libcutlass_rank_k_sm80_z884herk.so [ 96%] Built target cutlass_library_rank_k_sm80_s1688syrk [ 96%] Built target cutlass_library_rank_k_sm80_s1688tf32syrk [ 96%] Linking CUDA shared library libcutlass_rank_k_sm80_z884syrk.so [ 96%] Linking CUDA shared library libcutlass_rank_k_sm90_d1684syrk.so [ 96%] Built target cutlass_library_rank_k_sm80_z884herk [ 96%] Linking CUDA shared library libcutlass_rank_k_sm90_gz1684herk.so [ 96%] Built target cutlass_library_rank_k_sm80_z884syrk [ 96%] Built target cutlass_library_rank_k_sm90_d1684syrk [ 96%] Linking CUDA shared library libcutlass_rank_k_sm90_gz1684syrk.so [ 96%] Built target cutlass_library_rank_k_sm90_gz1684herk [ 97%] Linking CUDA shared library libcutlass_rank_k_sm90_z1684herk.so [ 97%] Linking CUDA shared library libcutlass_rank_k_sm90_z1684syrk.so [ 97%] Built target cutlass_library_rank_k_sm90_gz1684syrk [ 97%] Built target cutlass_library_rank_k_sm90_z1684herk [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm80_c1688her2k.so [ 97%] Built target cutlass_library_rank_k_sm90_z1684syrk [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm80_c1688syr2k.so [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm80_c1688tf32her2k.so [ 97%] Built target cutlass_library_rank_2k_sm80_c1688her2k [ 97%] Built target cutlass_library_rank_2k_sm80_c1688syr2k [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm80_c1688tf32syr2k.so [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm80_d884syr2k.so [ 97%] Built target cutlass_library_rank_2k_sm80_c1688tf32her2k [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm80_gz884her2k.so [ 97%] Built target cutlass_library_rank_2k_sm80_c1688tf32syr2k [ 97%] Built target cutlass_library_rank_2k_sm80_d884syr2k [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm80_s1688syr2k.so [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm80_gz884syr2k.so [ 97%] Built target cutlass_library_rank_2k_sm80_gz884her2k [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm80_s1688tf32syr2k.so [ 97%] Built target cutlass_library_rank_2k_sm80_gz884syr2k [ 97%] Built target cutlass_library_rank_2k_sm80_s1688syr2k [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm80_z884her2k.so [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm80_z884syr2k.so [ 97%] Built target cutlass_library_rank_2k_sm80_s1688tf32syr2k [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm90_d1684syr2k.so [ 97%] Built target cutlass_library_rank_2k_sm80_z884her2k [ 97%] Built target cutlass_library_rank_2k_sm80_z884syr2k [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm90_gz1684her2k.so [ 97%] Built target cutlass_library_rank_2k_sm90_d1684syr2k [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm90_gz1684syr2k.so [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm90_z1684her2k.so [ 97%] Built target cutlass_library_rank_2k_sm90_gz1684her2k [ 97%] Built target cutlass_library_rank_2k_sm90_gz1684syr2k [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm90_z1684syr2k.so [ 97%] Linking CUDA shared library libcutlass_trmm_sm80_c1688tf32trmm.so [ 97%] Built target cutlass_library_rank_2k_sm90_z1684her2k [ 97%] Linking CUDA shared library libcutlass_trmm_sm80_c1688trmm.so [ 97%] Built target cutlass_library_rank_2k_sm90_z1684syr2k [ 97%] Linking CUDA shared library libcutlass_trmm_sm80_d884trmm.so [ 97%] Built target cutlass_library_trmm_sm80_c1688tf32trmm [ 97%] Linking CUDA shared library libcutlass_trmm_sm80_gz884trmm.so [ 97%] Built target cutlass_library_trmm_sm80_c1688trmm [ 97%] Built target cutlass_library_trmm_sm80_d884trmm [ 97%] Linking CUDA shared library libcutlass_trmm_sm80_s1688tf32trmm.so [ 97%] Linking CUDA shared library libcutlass_trmm_sm80_s1688trmm.so [ 97%] Built target cutlass_library_trmm_sm80_gz884trmm [ 98%] Linking CUDA shared library libcutlass_trmm_sm80_z884trmm.so [ 98%] Built target cutlass_library_trmm_sm80_s1688tf32trmm [ 98%] Built target cutlass_library_trmm_sm80_s1688trmm [ 98%] Linking CUDA shared library libcutlass_trmm_sm90_d1684trmm.so [ 98%] Linking CUDA shared library libcutlass_trmm_sm90_gz1684trmm.so [ 98%] Built target cutlass_library_trmm_sm80_z884trmm [ 98%] Linking CUDA shared library libcutlass_trmm_sm90_z1684trmm.so [ 98%] Built target cutlass_library_trmm_sm90_d1684trmm [ 98%] Linking CUDA shared library libcutlass_symm_sm80_c1688hemm.so [ 98%] Built target cutlass_library_trmm_sm90_gz1684trmm [ 98%] Linking CUDA shared library libcutlass_symm_sm80_c1688symm.so [ 98%] Built target cutlass_library_trmm_sm90_z1684trmm [ 98%] Built target cutlass_library_symm_sm80_c1688hemm [ 98%] Linking CUDA shared library libcutlass_symm_sm80_c1688tf32hemm.so [ 98%] Built target cutlass_library_symm_sm80_c1688symm [ 99%] Linking CUDA shared library libcutlass_symm_sm80_c1688tf32symm.so [ 99%] Linking CUDA shared library libcutlass_symm_sm80_d884symm.so [ 99%] Built target cutlass_library_symm_sm80_c1688tf32hemm [ 99%] Built target cutlass_library_symm_sm80_c1688tf32symm [ 99%] Linking CUDA shared library libcutlass_symm_sm80_gz884hemm.so [ 99%] Built target cutlass_library_symm_sm80_d884symm [ 99%] Linking CUDA shared library libcutlass_symm_sm80_gz884symm.so [ 99%] Linking CUDA shared library libcutlass_symm_sm80_s1688symm.so [ 99%] Built target cutlass_library_symm_sm80_gz884hemm [ 99%] Linking CUDA shared library libcutlass_symm_sm80_s1688tf32symm.so [ 99%] Built target cutlass_library_symm_sm80_gz884symm [ 99%] Linking CUDA shared library libcutlass_symm_sm80_z884hemm.so [ 99%] Built target cutlass_library_symm_sm80_s1688symm [ 99%] Linking CUDA shared library libcutlass_symm_sm80_z884symm.so [ 99%] Built target cutlass_library_symm_sm80_s1688tf32symm [ 99%] Linking CUDA shared library libcutlass_symm_sm90_d1684symm.so [ 99%] Built target cutlass_library_symm_sm80_z884hemm [ 99%] Built target cutlass_library_symm_sm80_z884symm [ 99%] Linking CUDA shared library libcutlass_symm_sm90_gz1684hemm.so [ 99%] Linking CUDA shared library libcutlass_symm_sm90_gz1684symm.so [ 99%] Built target cutlass_library_symm_sm90_d1684symm [ 99%] Built target cutlass_library_symm_sm90_gz1684hemm [ 99%] Built target cutlass_library_symm_sm90_gz1684symm [ 99%] Built target cutlass_library_symm_sm90_z1684hemm_objs [ 99%] Linking CUDA shared library libcutlass_symm_sm90_z1684hemm.so [ 99%] Linking CUDA static library libcutlass_symm_sm90_z1684hemm.a [ 99%] Built target cutlass_library_symm_sm90_z1684hemm_static [ 99%] Linking CXX static library libcutlass.a [ 99%] Built target cutlass_library_symm_sm90_z1684hemm [ 99%] Linking CXX shared library libcutlass.so [ 99%] Built target cutlass_library_static [ 99%] Built target cutlass_library [ 99%] Building CXX object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/main.cpp.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/options.cu.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/cutlass_profiler.cu.o [ 99%] Building CXX object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/performance_report.cpp.o [ 99%] Building CXX object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/enumerated_types.cpp.o [ 99%] Building CXX object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/gpu_timer.cpp.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/device_allocation.cu.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/device_context.cu.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/cublas_helpers.cu.o [ 99%] Building CXX object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/cudnn_helpers.cpp.o [ 99%] Building CXX object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/problem_space.cpp.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/operation_profiler.cu.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/gemm_operation_profiler.cu.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/rank_k_operation_profiler.cu.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/rank_2k_operation_profiler.cu.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/trmm_operation_profiler.cu.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/symm_operation_profiler.cu.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/conv2d_operation_profiler.cu.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/conv3d_operation_profiler.cu.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/sparse_gemm_operation_profiler.cu.o [100%] Linking CXX executable cutlass_profiler [100%] Built target cutlass_profiler + popd ~/build/BUILD/cutlass + exit 0 Executing(%install): /bin/sh -e /var/tmp/rpm-tmp.iCobRD + umask 022 + cd /builddir/build/BUILD + '[' /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64 '!=' / ']' + rm -rf /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64 ++ dirname /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64 + mkdir -p /builddir/build/BUILDROOT + mkdir /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64 + cd cutlass + rm -rf /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64 + pushd build ~/build/BUILD/cutlass/build ~/build/BUILD/cutlass + DESTDIR=/builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64 + /usr/bin/cmake --install . -- Install configuration: "Release" -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/functional.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/functional.h.fp16~ -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/workspace.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/wmma_array.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/version.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/uint128.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/transform -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/transform/warp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/transform/warp/vector_fragment_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/transform/threadblock -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/transform/threadblock/vector_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/transform/threadblock/regular_tile_iterator_tensor_op_sm70.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/transform/threadblock/regular_tile_iterator_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/transform/threadblock/regular_tile_iterator_pitch_linear_2dthreadtile.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/transform/threadblock/regular_tile_iterator_pitch_linear.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/transform/threadblock/regular_tile_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/transform/threadblock/regular_tile_access_iterator_tensor_op_sm80.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/transform/threadblock/regular_tile_access_iterator_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/transform/threadblock/regular_tile_access_iterator_pitch_linear_direct_conv.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/transform/threadblock/regular_tile_access_iterator_pitch_linear.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/transform/threadblock/regular_tile_access_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/transform/threadblock/regular_scale_bias_vector_access_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/transform/threadblock/predicated_vector_access_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/transform/threadblock/predicated_tile_iterator_triangular_matrix.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/transform/threadblock/predicated_tile_iterator_2dthreadtile.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/transform/threadblock/predicated_tile_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/transform/threadblock/predicated_tile_access_iterator_triangular_matrix.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/transform/threadblock/predicated_tile_access_iterator_params.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/transform/threadblock/predicated_tile_access_iterator_2dthreadtile.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/transform/threadblock/predicated_tile_access_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/transform/threadblock/predicated_scale_bias_vector_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/transform/threadblock/predicated_scale_bias_vector_access_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/transform/threadblock/ell_predicated_tile_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/transform/threadblock/ell_predicated_tile_access_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/transform/threadblock/ell_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/transform/thread -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/transform/thread/unary_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/transform/thread/transpose.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/transform/pitch_linear_thread_map.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/transform/collective -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/transform/collective/sm90_wgmma_transpose.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/trace.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/thread -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/thread/matrix.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/tfloat32.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/tensor_view_planar_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/tensor_view.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/tensor_ref_planar_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/tensor_ref.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/tensor_coord.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/subbyte_reference.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/semaphore.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/relatively_equal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/reduction -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/reduction/threadblock_swizzle.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/reduction/thread -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/reduction/thread/reduction_operators.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/reduction/thread/reduce.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/reduction/kernel -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/reduction/kernel/tensor_reduce_affine_strided.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/reduction/kernel/tensor_reduce_affine_contiguous.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/reduction/kernel/reduce_split_k.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/reduction/kernel/reduce_softmax_final.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/reduction/device -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/reduction/device/tensor_reduce_affine_strided.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/reduction/device/tensor_reduce_affine_contiguous.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/reduction/device/tensor_reduce.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/reduction/device/reduce_split_k.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/real.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/quaternion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/predicate_vector.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/platform -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/platform/platform.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/pitch_linear_coord.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/pipeline -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/pipeline/sm90_pipeline.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/pipeline/pipeline.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/numeric_types.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/numeric_size.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/numeric_conversion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/matrix_shape.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/matrix_coord.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/matrix.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/layout -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/layout/vector.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/layout/tensor_op_multiplicand_sm80.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/layout/tensor_op_multiplicand_sm75.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/layout/tensor_op_multiplicand_sm70.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/layout/tensor.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/layout/pitch_linear.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/layout/permute.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/layout/matrix.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/layout/layout.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/kernel_launch.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/kernel_hardware_info.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/kernel_hardware_info.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/integer_subbyte.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/half.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm_coord.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm_coord.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/warp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/warp/tile_iterator_planar_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/warp/softmax_scale_bias_transform.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/warp/scale_bias_tile_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/warp/mma_with_reduction_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/warp/mma_tensor_op_wmma.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/warp/mma_tensor_op_tile_iterator_wmma.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/warp/mma_tensor_op_tile_iterator_sparse.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/warp/mma_tensor_op_tile_iterator_sm80.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/warp/mma_tensor_op_tile_iterator_sm70.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/warp/mma_tensor_op_tile_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/warp/mma_tensor_op_tile_access_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/warp/mma_tensor_op_sm70.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/warp/mma_tensor_op_policy.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/warp/mma_tensor_op_fragment_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/warp/mma_tensor_op_fast_f32.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/warp/mma_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/warp/mma_sparse_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/warp/mma_simt_tile_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/warp/mma_simt_policy.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/warp/mma_simt.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/warp/mma_planar_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/warp/mma_mixed_input_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/warp/mma_gaussian_complex_tensor_op_tile_iterator_sm80.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/warp/mma_gaussian_complex_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/warp/mma_complex_tensor_op_tile_iterator_sm80.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/warp/mma_complex_tensor_op_fast_f32.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/warp/mma_complex_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/warp/mma.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/warp/layernorm_scale_bias_transform.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/warp/default_mma_wmma_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/warp/default_mma_with_reduction_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/warp/default_mma_tensor_op_sm80.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/warp/default_mma_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/warp/default_mma_sparse_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/warp/default_mma_complex_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/threadblock_swizzle_streamk.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/threadblock_swizzle.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/mma_with_reduction_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/mma_sparse_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/mma_sparse_base.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/mma_softmax_mainloop_fusion_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/mma_singlestage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/mma_planar_complex_pipelined.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/mma_planar_complex_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/mma_planar_complex_base.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/mma_pipelined.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/mma_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/mma_layernorm_mainloop_fusion_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/mma_blas3_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/mma_base.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/index_remat.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/gemv.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/ell_mma_pipelined.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/ell_mma_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/default_trmm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/default_sparse_mma.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/default_multistage_trmm_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/default_multistage_mma_complex_core_sm80.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/default_multistage_mma_complex_core.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/default_multistage_mma_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/default_mma_with_reduction.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/default_mma_softmax_mainloop_fusion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/default_mma_planar_complex_pipelined.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/default_mma_planar_complex_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/default_mma_layernorm_mainloop_fusion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/default_mma_core_wmma.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/default_mma_core_with_reduction.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/default_mma_core_with_access_size.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/default_mma_core_sparse_sm80.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/default_mma_core_sm80.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/default_mma_core_sm75.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/default_mma_core_sm70.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/default_mma_core_simt.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/default_mma_core.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/default_mma.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/default_gemv_core.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/threadblock/default_ell_mma.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/thread -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/thread/mma_sm61.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/thread/mma_sm60.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/thread/mma_sm50.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/thread/mma.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/trmm_universal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/tile_scheduler_params.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/tile_scheduler.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/symm_universal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/static_tile_scheduler.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/sparse_gemm_with_visitor.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/sparse_gemm_with_absmax.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/sparse_gemm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/sm90_tile_scheduler_stream_k.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/sm90_tile_scheduler_group.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/sm90_tile_scheduler.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/sm90_gemm_warpspecialized_pingpong.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/sm90_gemm_warpspecialized_cooperative.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/sm90_gemm_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/sm90_gemm_tma.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_cooperative.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/sm70_gemm.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/rank_k_universal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/rank_2k_universal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/rank_2k_transpose_operands.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/rank_2k_grouped_problem_visitor.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/rank_2k_grouped.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/params_universal_base.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/params_sparse_base.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/grouped_problem_visitor.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/gemv_batched_strided.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/gemv.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/gemm_with_k_reduction.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/gemm_with_fused_epilogue.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/gemm_with_absmax.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/gemm_universal_with_visitor_streamk.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/gemm_universal_with_visitor.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/gemm_universal_streamk.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/gemm_universal.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/gemm_universal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/gemm_transpose_operands.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/gemm_streamk_with_fused_epilogue.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/gemm_splitk_parallel.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/gemm_planar_complex_array.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/gemm_planar_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/gemm_pipelined.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/gemm_params.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/gemm_layernorm_mainloop_fusion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/gemm_grouped_softmax_mainloop_fusion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/gemm_grouped_problem_visitor.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/gemm_grouped.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/gemm_batched.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/gemm_array.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/gemm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/ell_gemm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/default_trmm_universal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/default_trmm_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/default_trmm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/default_symm_universal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/default_symm_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/default_symm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/default_rank_k_universal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/default_rank_k_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/default_rank_k.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/default_rank_2k_universal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/default_rank_2k_grouped.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/default_rank_2k_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/default_rank_2k.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/default_gemv.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/default_gemm_with_reduction.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/default_gemm_with_k_reduction.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/default_gemm_with_broadcast.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/default_gemm_with_absmax.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/default_gemm_universal_with_visitor.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/default_gemm_universal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/default_gemm_streamk_with_broadcast.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/default_gemm_splitk_parallel.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/default_gemm_sparse_with_visitor.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/default_gemm_sparse_with_absmax.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/default_gemm_sparse.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/default_gemm_planar_complex_universal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/default_gemm_layernorm_mainloop_fusion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/default_gemm_grouped_softmax_mainloop_fusion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/default_gemm_grouped.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/default_gemm_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/default_gemm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/kernel/default_ell_gemm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/group_array_problem_shape.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/gemm_enumerated_types.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/gemm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/dispatch_policy.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/device -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/device/trmm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/device/symm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/device/rank_k.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/device/rank_2k_grouped.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/device/rank_2k.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/device/gemv.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/device/gemm_with_k_reduction.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/device/gemm_universal_with_broadcast.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/device/gemm_universal_with_absmax.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/device/gemm_universal_streamk_with_broadcast.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/device/gemm_universal_base.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/device/gemm_universal_adapter.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/device/gemm_universal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/device/gemm_splitk_parallel.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/device/gemm_sparse_with_visitor.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/device/gemm_sparse_with_absmax.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/device/gemm_sparse.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/device/gemm_layernorm_mainloop_fusion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/device/gemm_grouped.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/device/gemm_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/device/gemm_batched.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/device/gemm_array.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/device/gemm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/device/ell_gemm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/device/default_gemm_configuration.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/device/base_grouped.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/collective -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized_fp8.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/collective/sm90_mma_tma_gmma_rs_warpspecialized_mixed_input.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/collective/sm90_mma_tma_gmma_rs_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/collective/sm90_mma_multistage_gmma_ss_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/collective/sm90_mma_multistage_gmma_rs_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/collective/sm90_mma_array_tma_gmma_ss_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/collective/sm80_mma_multistage.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/collective/sm70_mma_twostage.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/collective/fp8_accumulation.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/collective/collective_mma.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/collective/collective_builder.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/collective/builders -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/collective/builders/sm90_gmma_builder.inl -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/gemm/collective/builders/sm90_common.inl -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/floating_point_nvrtc.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/float8.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/fast_math.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/warp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/warp/wmma_tensor_op_policy.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/warp/volta_tensor_op_policy.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/warp/tile_iterator_wmma_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/warp/tile_iterator_volta_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/warp/tile_iterator_tensor_op_mixed.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/warp/tile_iterator_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/warp/tile_iterator_simt.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/warp/tensor_op_policy.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/warp/simt_policy.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/warp/fragment_iterator_wmma_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/warp/fragment_iterator_volta_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/warp/fragment_iterator_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/warp/fragment_iterator_simt.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/warp/fragment_iterator_gaussian_complex_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/warp/fragment_iterator_complex_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/shared_load_iterator_pitch_linear.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/shared_load_iterator_mixed.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/shared_load_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/predicated_tile_iterator_strided_dgrad.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/predicated_tile_iterator_predicates.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/predicated_tile_iterator_params.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/predicated_tile_iterator_direct_conv.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/predicated_tile_iterator_conv.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/predicated_tile_iterator_blas3.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/predicated_tile_iterator_affine_layout_params.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/predicated_tile_iterator_affine.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/predicated_tile_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/output_tile_thread_map.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/output_iterator_parameter.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/interleaved_epilogue.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/fusion -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/fusion/visitors.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/fusion/visitor_store.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/fusion/visitor_load.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/fusion/visitor_compute.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/fusion/visitor_2x.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/epilogue_workspace.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/epilogue_with_visitor_callbacks.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/epilogue_with_visitor.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/epilogue_with_reduction.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/epilogue_with_broadcast.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/epilogue_with_absmax.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/epilogue_visitor_with_softmax.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/epilogue_streamk_with_broadcast.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/epilogue_smem_accumulator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/epilogue_planar_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/epilogue_gemm_k_reduction.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/epilogue_direct_store.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/epilogue_depthwise.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/epilogue_base_streamk.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/epilogue_base.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/epilogue.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/direct_store_epilogue_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/default_thread_map_wmma_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/default_thread_map_volta_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/default_thread_map_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/default_thread_map_simt.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/default_epilogue_wmma_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/default_epilogue_with_reduction.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/default_epilogue_with_broadcast.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/default_epilogue_with_absmax.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/default_epilogue_volta_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/default_epilogue_tensor_op_blas3.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/default_epilogue_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/default_epilogue_simt.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/default_epilogue_planar_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/default_epilogue_direct_store.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/default_epilogue_complex_tensor_op_blas3.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/threadblock/default_epilogue_complex_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/thread -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/thread/scale_type.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/thread/reduction_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/thread/linear_combination_with_elementwise.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/thread/linear_combination_tensor_broadcast.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/thread/linear_combination_silu.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/thread/linear_combination_sigmoid.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/thread/linear_combination_residual_block.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/thread/linear_combination_relu0.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/thread/linear_combination_relu.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/thread/linear_combination_planar_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/thread/linear_combination_params.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/thread/linear_combination_leaky_relu.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/thread/linear_combination_hardswish.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/thread/linear_combination_generic_with_scaling.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/thread/linear_combination_generic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/thread/linear_combination_gelu.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/thread/linear_combination_drelu.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/thread/linear_combination_dgelu.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/thread/linear_combination_clamp.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/thread/linear_combination_bias_relu.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/thread/linear_combination_bias_elementwise.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/thread/linear_combination.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/thread/detail.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/thread/conversion_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/thread/activation.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/fusion -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/fusion/sm90_visitor_tma_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/fusion/sm90_visitor_store_tma_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/fusion/sm90_visitor_load_tma_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/fusion/sm90_visitor_compute_tma_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/fusion/sm90_callbacks_tma_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/fusion/operations.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/fusion/callbacks.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/dispatch_policy.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/collective -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized_bias_elementwise.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/collective/sm70_epilogue_vectorized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/collective/epilogue_tensor_broadcast.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/collective/detail.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/collective/default_epilogue_array.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/collective/default_epilogue.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/collective/collective_epilogue.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/collective/collective_builder.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/collective/builders -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/epilogue/collective/builders/sm90_builder.inl -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/device_kernel.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/detail -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/detail/mma.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/detail/layout.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/detail/helper_macros.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/detail/dependent_false.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/detail/collective.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/cutlass.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/cuda_host_adapter.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/core_io.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/coord.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/warp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/warp/scale_bias_relu_transform.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/warp/mma_depthwise_simt_tile_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/warp/mma_depthwise_simt.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/threadblock_swizzle.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/predicated_scale_bias_vector_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/predicated_scale_bias_vector_access_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/implicit_gemm_wgrad_fusion_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/implicit_gemm_pipelined.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/implicit_gemm_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/implicit_gemm_fprop_fusion_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/depthwise_mma_core_with_lane_access_size.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/depthwise_mma_base.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/depthwise_fprop_pipelined.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/depthwise_fprop_filter_tile_access_iterator_direct_conv_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/depthwise_fprop_direct_conv_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/depthwise_fprop_activation_tile_access_iterator_direct_conv_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/depthwise_fprop_activation_tile_access_iterator_direct_conv_fixed_stride_dilation.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/depthwise_direct_conv_params.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/conv3d_wgrad_output_gradient_tile_access_iterator_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/conv3d_wgrad_output_gradient_tile_access_iterator_analytic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/conv3d_wgrad_activation_tile_access_iterator_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/conv3d_wgrad_activation_tile_access_iterator_analytic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/conv3d_params.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/conv3d_fprop_filter_tile_access_iterator_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/conv3d_fprop_filter_tile_access_iterator_analytic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/conv3d_fprop_activation_tile_access_iterator_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/conv3d_fprop_activation_tile_access_iterator_analytic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/conv3d_dgrad_output_gradient_tile_access_iterator_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/conv3d_dgrad_output_gradient_tile_access_iterator_analytic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/conv3d_dgrad_filter_tile_access_iterator_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/conv3d_dgrad_filter_tile_access_iterator_analytic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/conv2d_wgrad_output_gradient_tile_access_iterator_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/conv2d_wgrad_output_gradient_tile_access_iterator_analytic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/conv2d_wgrad_activation_tile_access_iterator_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/conv2d_wgrad_activation_tile_access_iterator_analytic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/conv2d_tile_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/conv2d_params.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/conv2d_fprop_filter_tile_access_iterator_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/conv2d_fprop_filter_tile_access_iterator_fixed_channels.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/conv2d_fprop_filter_tile_access_iterator_few_channels.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/conv2d_fprop_filter_tile_access_iterator_analytic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/conv2d_fprop_activation_tile_access_iterator_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/conv2d_fprop_activation_tile_access_iterator_fixed_channels.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/conv2d_fprop_activation_tile_access_iterator_few_channels.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/conv2d_fprop_activation_tile_access_iterator_analytic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/conv2d_dgrad_output_gradient_tile_access_iterator_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/conv2d_dgrad_output_gradient_tile_access_iterator_analytic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/conv2d_dgrad_filter_tile_access_iterator_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/threadblock/conv2d_dgrad_filter_tile_access_iterator_analytic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/thread -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/thread/depthwise_mma.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/kernel -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/kernel/implicit_gemm_convolution_with_fused_epilogue.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/kernel/implicit_gemm_convolution_with_absmax.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/kernel/implicit_gemm_convolution_strided_dgrad.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/kernel/implicit_gemm_convolution_fusion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/kernel/implicit_gemm_convolution.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/kernel/direct_convolution.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/kernel/default_depthwise_fprop.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/kernel/default_deconv3d_with_broadcast.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/kernel/default_deconv3d.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/kernel/default_deconv2d_with_broadcast.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/kernel/default_deconv2d.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/kernel/default_conv3d_wgrad.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/kernel/default_conv3d_fprop_with_broadcast.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/kernel/default_conv3d_fprop_fusion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/kernel/default_conv3d_fprop.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/kernel/default_conv3d_dgrad.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/kernel/default_conv2d_wgrad_fusion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/kernel/default_conv2d_wgrad.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/kernel/default_conv2d_group_fprop.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/kernel/default_conv2d_fprop_with_reduction.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/kernel/default_conv2d_fprop_with_broadcast.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/kernel/default_conv2d_fprop_with_absmax.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/kernel/default_conv2d_fprop_fusion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/kernel/default_conv2d_fprop.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/kernel/default_conv2d_dgrad.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/kernel/default_conv2d.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/kernel/conv_universal.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/dispatch_policy.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/device -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/device/implicit_gemm_convolution_fusion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/device/implicit_gemm_convolution.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/device/direct_convolution.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/device/conv_universal_adapter.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/convolution.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/convnd_problem_shape.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/conv3d_problem_size.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/conv2d_problem_size.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/collective -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/collective/sm90_implicit_gemm_gmma_ss_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/collective/detail.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/collective/collective_conv.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/collective/collective_builder.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/collective/builders -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/collective/builders/sm90_gmma_builder.inl -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/conv/collective/builders/sm90_common.inl -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/constants.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/cluster_launch.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/block_striped.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/blas3_types.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/blas3.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/bfloat16.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/barrier.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/array_subbyte.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/array_planar_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/array.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/arch -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/arch/wmma_sm75.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/arch/wmma_sm72.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/arch/wmma_sm70.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/arch/wmma.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/arch/simd_sm61.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/arch/simd_sm60.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/arch/simd.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/arch/reg_reconfig.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/arch/mma_sparse_sm89.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/arch/mma_sparse_sm80.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/arch/mma_sm90.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/arch/mma_sm89.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/arch/mma_sm80.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/arch/mma_sm75.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/arch/mma_sm70.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/arch/mma_sm61.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/arch/mma_sm60.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/arch/mma_sm50.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/arch/mma.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/arch/memory_sm80.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/arch/memory_sm75.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/arch/memory.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/arch/cache_operation.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/arch/barrier.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/arch/arch.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/aligned_buffer.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/util -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/util/type_traits.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/util/print.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/util/debug.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/underscore.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/tensor_predicate.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/tensor.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/swizzle_layout.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/swizzle.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/stride.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/pointer_swizzle.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/pointer_flagged.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/pointer_base.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/pointer.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/numeric -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/numeric/real.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/numeric/numeric_types.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/numeric/math.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/numeric/integral_ratio.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/numeric/integral_constant.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/numeric/integer_sequence.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/numeric/int.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/numeric/complex.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/numeric/arithmetic_tuple.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/layout_composed.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/layout.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/int_tuple.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/container -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/container/type_list.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/container/tuple.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/container/cuda_types.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/container/bit_field.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/container/array_subbyte.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/container/array_aligned.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/container/array.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/container/alignment.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/config.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/atom -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/atom/mma_traits_sm90_gmma.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/atom/mma_traits_sm90.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/atom/mma_traits_sm80.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/atom/mma_traits_sm75.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/atom/mma_traits_sm70.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/atom/mma_traits_sm61.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/atom/mma_traits.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/atom/mma_atom.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/atom/copy_traits_sm90_tma_swizzle.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/atom/copy_traits_sm90_tma.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/atom/copy_traits_sm90_im2col.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/atom/copy_traits_sm90.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/atom/copy_traits_sm80.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/atom/copy_traits_sm75.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/atom/copy_traits_sm50.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/atom/copy_traits.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/atom/copy_atom.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/arch -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/arch/util.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/arch/mma_sm90_gmma.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/arch/mma_sm90_desc.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/arch/mma_sm90.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/arch/mma_sm80.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/arch/mma_sm75.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/arch/mma_sm70.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/arch/mma_sm61.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/arch/mma.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/arch/copy_sm90_tma.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/arch/copy_sm90_desc.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/arch/copy_sm90.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/arch/copy_sm80.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/arch/copy_sm75.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/arch/copy_sm50.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/arch/copy.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/arch/cluster_sm90.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/algorithm -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/algorithm/tuple_algorithms.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/algorithm/tensor_algorithms.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/algorithm/prefetch.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/algorithm/prefer.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/algorithm/gemm.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/algorithm/functional.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/algorithm/fill.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/algorithm/copy.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/algorithm/cooperative_gemm.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/algorithm/cooperative_copy.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/algorithm/clear.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cute/algorithm/axpby.hpp -- Up-to-date: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include -- Up-to-date: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/cutlass/version_extended.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/test/cutlass -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/test/cutlass/bin -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/test/cutlass/lib64 -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/test/cutlass/ctest -- Up-to-date: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/ -- Up-to-date: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/type_traits.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/tensor_view_io.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/host -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/host/trmm_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/host/trmm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/host/tensor_reduce.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/host/tensor_reduce.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/host/tensor_norm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/host/tensor_foreach.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/host/tensor_fill.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/host/tensor_fill.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/host/tensor_elementwise.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/host/tensor_copy.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/host/tensor_compare.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/host/tensor_compare.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/host/symm_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/host/symm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/host/rank_k_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/host/rank_2k_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/host/rank_2k.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/host/gett.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/host/gemm_planar_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/host/gemm_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/host/gemm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/host/error_metrics.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/host/convolution.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/host/conv.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/device -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/device/thread -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/device/thread/gemm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/device/tensor_relu.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/device/tensor_reduce.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/device/tensor_foreach.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/device/tensor_fill.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/device/tensor_compare.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/device/rank_2k_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/device/kernel -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/device/kernel/tensor_foreach.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/device/kernel/tensor_elementwise.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/device/kernel/gemm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/device/gett.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/device/gemm_planar_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/device/gemm_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/device/gemm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/device/convolution.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/detail -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/detail/linear_to_coordinate.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/reference/detail/inner_product.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/print_error.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/packed_stride.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/index_sequence.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/host_uncompress.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/host_tensor_planar_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/host_tensor.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/host_reorder.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/helper_cuda.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/gett_commandline.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/exceptions.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/distribution.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/device_utils.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/device_rmsnorm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/device_nhwc_to_nchw.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/device_nhwc_pooling.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/device_nhwc_padding.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/device_nchw_to_nhwc.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/device_memory.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/device_layernorm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/device_groupnorm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/device_dump.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/debug.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/cublas_wrappers.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/command_line.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/util/GPU_Clock.hpp -- Up-to-date: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include/ -- Up-to-date: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/library -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/library/util.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/library/types.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/library/singleton.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/library/operation_table.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/library/manifest.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/library/library.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/library/handle.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/library/descriptions.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/include//cutlass/library/arch_mappings.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm50_cgemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm50_cgemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm50_dgemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm50_dgemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm50_sgemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm50_sgemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm60_hgemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm60_hgemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm61_igemm_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm61_igemm_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm61_s8_igemm_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm61_s8_igemm_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm70_f16_s884gemm_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm70_f16_s884gemm_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm70_f16_s884gemm_planar_complex_array_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm70_f16_s884gemm_planar_complex_array_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm70_f16_s884gemm_planar_complex_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm70_f16_s884gemm_planar_complex_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm70_h884gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm70_h884gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm70_h884gemm_planar_complex.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm70_h884gemm_planar_complex.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm70_h884gemm_planar_complex_array.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm70_h884gemm_planar_complex_array.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm70_s884gemm_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm70_s884gemm_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm70_s884gemm_planar_complex_array_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm70_s884gemm_planar_complex_array_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm70_s884gemm_planar_complex_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm70_s884gemm_planar_complex_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_f16_s1688gemm_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_f16_s1688gemm_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_array_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_array_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_h1688gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_h1688gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_h1688gemm_planar_complex.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_h1688gemm_planar_complex.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_h1688gemm_planar_complex_array.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_h1688gemm_planar_complex_array.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_i88128xorgemm_b1.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_i88128xorgemm_b1.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_i8816gemm_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_i8816gemm_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_i8816gemm_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_i8816gemm_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_i8832gemm_s4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_i8832gemm_s4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_i8832gemm_u4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_i8832gemm_u4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_s1688gemm_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_s1688gemm_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_s1688gemm_planar_complex_array_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_s1688gemm_planar_complex_array_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_s1688gemm_planar_complex_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_s1688gemm_planar_complex_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_s4_i8832gemm_s4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_s4_i8832gemm_s4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_s8_i8816gemm_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_s8_i8816gemm_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_u4_i8832gemm_u4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_u4_i8832gemm_u4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_u8_i8816gemm_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_u8_i8816gemm_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_bf16_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_bf16_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_bf16_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_bf16_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_s8_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_s8_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_u8_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_u8_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_bf16_s16832spgemm_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_bf16_s16832spgemm_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_c1688gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_c1688gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_c1688tf32gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_c1688tf32gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_cgemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_cgemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_d884gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_d884gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_dgemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_dgemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_f16_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_f16_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_f16_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_f16_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_array_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_array_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_s8_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_s8_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_u8_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_u8_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_f16_s16832spgemm_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_f16_s16832spgemm_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_gz884gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_gz884gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_h16816gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_h16816gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_h16816gemm_grouped.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_h16816gemm_grouped.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_h16816gemm_planar_complex.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_h16816gemm_planar_complex.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_h16816gemm_planar_complex_array.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_h16816gemm_planar_complex_array.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_h16816gemm_s8_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_h16816gemm_s8_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_h16832spgemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_h16832spgemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_i168128spgemm_s4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_i168128spgemm_s4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_i168256andgemm_b1.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_i168256andgemm_b1.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_i168256xorgemm_b1.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_i168256xorgemm_b1.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_i16832gemm_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_i16832gemm_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_i16832gemm_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_i16832gemm_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_i16864gemm_s4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_i16864gemm_s4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_i16864gemm_u4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_i16864gemm_u4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_i16864spgemm_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_i16864spgemm_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_bf16_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_bf16_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_bf16_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_bf16_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_f16_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_f16_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_f16_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_f16_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_grouped_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_grouped_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_grouped_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_grouped_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_planar_complex_array_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_planar_complex_array_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_planar_complex_array_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_planar_complex_array_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_planar_complex_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_planar_complex_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_planar_complex_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_planar_complex_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_s8_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_s8_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_s8_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_s8_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_u8_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_u8_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_u8_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_u8_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816tf32spgemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816tf32spgemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16832spgemm_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16832spgemm_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16832spgemm_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16832spgemm_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s1688bf16gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s1688bf16gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s1688f16gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s1688f16gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s1688gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s1688gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s1688gemm_tf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s1688gemm_tf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s1688tf32gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s1688tf32gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s4_i168128spgemm_s4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s4_i168128spgemm_s4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s4_i16864gemm_s4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s4_i16864gemm_s4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s8_i16832gemm_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s8_i16832gemm_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s8_i16864spgemm_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s8_i16864spgemm_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_sgemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_sgemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_tf32_s1688gemm_tf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_tf32_s1688gemm_tf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_u4_i16864gemm_u4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_u4_i16864gemm_u4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_u8_i16832gemm_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_u8_i16832gemm_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_z884gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_z884gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16832gemm_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16832gemm_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16832gemm_e4m3_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16832gemm_e4m3_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16832gemm_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16832gemm_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16832gemm_e5m2_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16832gemm_e5m2_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16864spgemm_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16864spgemm_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16864spgemm_e4m3_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16864spgemm_e4m3_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16864spgemm_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16864spgemm_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16864spgemm_e5m2_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16864spgemm_e5m2_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x16gemm_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x16gemm_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_d1684gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_d1684gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x16gemm_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x16gemm_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_gz1684gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_gz1684gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_h64x128x16gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_h64x128x16gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_i64x128x32gemm_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_i64x128x32gemm_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_i64x128x32gemm_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_i64x128x32gemm_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_s64x128x16gemm_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_s64x128x16gemm_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_s64x128x16gemm_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_s64x128x16gemm_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_s64x128x32gemm_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_s64x128x32gemm_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_s64x128x32gemm_e4m3_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_s64x128x32gemm_e4m3_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_s64x128x32gemm_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_s64x128x32gemm_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_s64x128x32gemm_e5m2_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_s64x128x32gemm_e5m2_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_s64x128x8gemm_tf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_s64x128x8gemm_tf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_s64x128x8tf32gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_s64x128x8tf32gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_s8_i64x128x32gemm_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_s8_i64x128x32gemm_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_s8_i64x128x32gemm_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_s8_i64x128x32gemm_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_void_i64x128x32gemm_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_void_i64x128x32gemm_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_void_i64x128x32gemm_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_void_i64x128x32gemm_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_void_s64x128x16gemm_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_void_s64x128x16gemm_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_void_s64x128x16gemm_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_void_s64x128x16gemm_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_z1684gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_z1684gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm50_cf32_cdgrad_optimized_cf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm50_cf32_cdgrad_optimized_cf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm50_cf32_cfprop_optimized_cf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm50_cf32_cfprop_optimized_cf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm50_cf32_cwgrad_optimized_cf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm50_cf32_cwgrad_optimized_cf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm50_sdgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm50_sdgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm50_sfprop_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm50_sfprop_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm50_swgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm50_swgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm60_hfprop_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm60_hfprop_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm70_f16_s884dgrad_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm70_f16_s884dgrad_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm70_f16_s884fprop_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm70_f16_s884fprop_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm70_f16_s884wgrad_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm70_f16_s884wgrad_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm70_h884dgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm70_h884dgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm70_h884fprop_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm70_h884fprop_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm70_h884wgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm70_h884wgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm70_s884dgrad_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm70_s884dgrad_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm70_s884fprop_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm70_s884fprop_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm70_s884wgrad_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm70_s884wgrad_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_cf32_cdgrad_optimized_cf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_cf32_cdgrad_optimized_cf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_cf32_cfprop_optimized_cf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_cf32_cfprop_optimized_cf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_cf32_cwgrad_optimized_cf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_cf32_cwgrad_optimized_cf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_f16_s1688dgrad_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_f16_s1688dgrad_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_f16_s1688fprop_few_channels_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_f16_s1688fprop_few_channels_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_f16_s1688fprop_fixed_channels_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_f16_s1688fprop_fixed_channels_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_f16_s1688fprop_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_f16_s1688fprop_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_f16_s1688wgrad_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_f16_s1688wgrad_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_h1688dgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_h1688dgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_h1688fprop_few_channels.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_h1688fprop_few_channels.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_h1688fprop_fixed_channels.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_h1688fprop_fixed_channels.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_h1688fprop_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_h1688fprop_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_h1688wgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_h1688wgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_i8816fprop_optimized_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_i8816fprop_optimized_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_i8816fprop_optimized_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_i8816fprop_optimized_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_i8832fprop_optimized_s4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_i8832fprop_optimized_s4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_i8832fprop_optimized_u4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_i8832fprop_optimized_u4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_s1688dgrad_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_s1688dgrad_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_s1688fprop_few_channels_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_s1688fprop_few_channels_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_s1688fprop_fixed_channels_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_s1688fprop_fixed_channels_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_s1688fprop_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_s1688fprop_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_s1688wgrad_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_s1688wgrad_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_s4_i8832fprop_optimized_s4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_s4_i8832fprop_optimized_s4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_s8_i8816fprop_few_channels_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_s8_i8816fprop_few_channels_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_s8_i8816fprop_fixed_channels_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_s8_i8816fprop_fixed_channels_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_s8_i8816fprop_optimized_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_s8_i8816fprop_optimized_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_u4_i8832fprop_optimized_u4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_u4_i8832fprop_optimized_u4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_u8_i8816fprop_few_channels_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_u8_i8816fprop_few_channels_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_u8_i8816fprop_fixed_channels_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_u8_i8816fprop_fixed_channels_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_u8_i8816fprop_optimized_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_u8_i8816fprop_optimized_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_bf16_s16816dgrad_optimized_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_bf16_s16816dgrad_optimized_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_bf16_s16816fprop_fixed_channels_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_bf16_s16816fprop_fixed_channels_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_bf16_s16816fprop_optimized_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_bf16_s16816fprop_optimized_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_bf16_s16816wgrad_optimized_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_bf16_s16816wgrad_optimized_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_f16_s16816dgrad_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_f16_s16816dgrad_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_f16_s16816fprop_fixed_channels_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_f16_s16816fprop_fixed_channels_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_f16_s16816fprop_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_f16_s16816fprop_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_f16_s16816wgrad_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_f16_s16816wgrad_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_h16816dgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_h16816dgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_h16816fprop_fixed_channels.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_h16816fprop_fixed_channels.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_h16816fprop_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_h16816fprop_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_h16816wgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_h16816wgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_i16832fprop_optimized_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_i16832fprop_optimized_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_i16832fprop_optimized_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_i16832fprop_optimized_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_i16864fprop_optimized_s4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_i16864fprop_optimized_s4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_i16864fprop_optimized_u4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_i16864fprop_optimized_u4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s16816dgrad_optimized_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s16816dgrad_optimized_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s16816dgrad_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s16816dgrad_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s16816fprop_fixed_channels_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s16816fprop_fixed_channels_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s16816fprop_fixed_channels_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s16816fprop_fixed_channels_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s16816fprop_optimized_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s16816fprop_optimized_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s16816fprop_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s16816fprop_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s16816wgrad_optimized_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s16816wgrad_optimized_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s16816wgrad_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s16816wgrad_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688bf16dgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688bf16dgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688bf16fprop_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688bf16fprop_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688bf16wgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688bf16wgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688dgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688dgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688dgrad_optimized_tf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688dgrad_optimized_tf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688f16dgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688f16dgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688f16fprop_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688f16fprop_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688f16wgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688f16wgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688fprop_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688fprop_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688fprop_optimized_tf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688fprop_optimized_tf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688tf32dgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688tf32dgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688tf32fprop_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688tf32fprop_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688tf32wgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688tf32wgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688wgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688wgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688wgrad_optimized_tf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688wgrad_optimized_tf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s4_i16864fprop_optimized_s4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s4_i16864fprop_optimized_s4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s8_i16832fprop_few_channels_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s8_i16832fprop_few_channels_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s8_i16832fprop_fixed_channels_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s8_i16832fprop_fixed_channels_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s8_i16832fprop_optimized_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s8_i16832fprop_optimized_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_sdgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_sdgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_sfprop_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_sfprop_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_swgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_swgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_tf32_s1688dgrad_optimized_tf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_tf32_s1688dgrad_optimized_tf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_tf32_s1688fprop_optimized_tf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_tf32_s1688fprop_optimized_tf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_tf32_s1688wgrad_optimized_tf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_tf32_s1688wgrad_optimized_tf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_u4_i16864fprop_optimized_u4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_u4_i16864fprop_optimized_u4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_u8_i16832fprop_few_channels_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_u8_i16832fprop_few_channels_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_u8_i16832fprop_fixed_channels_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_u8_i16832fprop_fixed_channels_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_u8_i16832fprop_optimized_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_u8_i16832fprop_optimized_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm89_s16832fprop_optimized_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm89_s16832fprop_optimized_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm89_s16832fprop_optimized_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm89_s16832fprop_optimized_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_bf16_s16816dgrad3d_analytic_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_bf16_s16816dgrad3d_analytic_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_bf16_s16816dgrad3d_optimized_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_bf16_s16816dgrad3d_optimized_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_bf16_s16816fprop3d_optimized_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_bf16_s16816fprop3d_optimized_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_bf16_s16816wgrad3d_optimized_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_bf16_s16816wgrad3d_optimized_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_f16_s16816dgrad3d_analytic_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_f16_s16816dgrad3d_analytic_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_f16_s16816dgrad3d_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_f16_s16816dgrad3d_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_f16_s16816fprop3d_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_f16_s16816fprop3d_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_f16_s16816wgrad3d_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_f16_s16816wgrad3d_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_h16816dgrad3d_analytic.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_h16816dgrad3d_analytic.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_h16816dgrad3d_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_h16816dgrad3d_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_h16816fprop3d_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_h16816fprop3d_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_h16816wgrad3d_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_h16816wgrad3d_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_s16816dgrad3d_analytic_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_s16816dgrad3d_analytic_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_s16816dgrad3d_analytic_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_s16816dgrad3d_analytic_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_s16816dgrad3d_optimized_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_s16816dgrad3d_optimized_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_s16816dgrad3d_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_s16816dgrad3d_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_s16816fprop3d_optimized_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_s16816fprop3d_optimized_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_s16816fprop3d_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_s16816fprop3d_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_s16816wgrad3d_optimized_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_s16816wgrad3d_optimized_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_s16816wgrad3d_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_s16816wgrad3d_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm80_c1688herk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm80_c1688herk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm80_c1688syrk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm80_c1688syrk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm80_c1688tf32herk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm80_c1688tf32herk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm80_c1688tf32syrk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm80_c1688tf32syrk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm80_d884syrk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm80_d884syrk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm80_gz884herk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm80_gz884herk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm80_gz884syrk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm80_gz884syrk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm80_s1688syrk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm80_s1688syrk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm80_s1688tf32syrk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm80_s1688tf32syrk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm80_z884herk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm80_z884herk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm80_z884syrk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm80_z884syrk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm90_d1684syrk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm90_d1684syrk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm90_gz1684herk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm90_gz1684herk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm90_gz1684syrk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm90_gz1684syrk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm90_z1684herk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm90_z1684herk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm90_z1684syrk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm90_z1684syrk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm80_c1688her2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm80_c1688her2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm80_c1688syr2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm80_c1688syr2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm80_c1688tf32her2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm80_c1688tf32her2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm80_c1688tf32syr2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm80_c1688tf32syr2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm80_d884syr2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm80_d884syr2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm80_gz884her2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm80_gz884her2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm80_gz884syr2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm80_gz884syr2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm80_s1688syr2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm80_s1688syr2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm80_s1688tf32syr2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm80_s1688tf32syr2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm80_z884her2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm80_z884her2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm80_z884syr2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm80_z884syr2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm90_d1684syr2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm90_d1684syr2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm90_gz1684her2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm90_gz1684her2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm90_gz1684syr2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm90_gz1684syr2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm90_z1684her2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm90_z1684her2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm90_z1684syr2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm90_z1684syr2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_trmm_sm80_c1688tf32trmm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_trmm_sm80_c1688tf32trmm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_trmm_sm80_c1688trmm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_trmm_sm80_c1688trmm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_trmm_sm80_d884trmm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_trmm_sm80_d884trmm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_trmm_sm80_gz884trmm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_trmm_sm80_gz884trmm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_trmm_sm80_s1688tf32trmm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_trmm_sm80_s1688tf32trmm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_trmm_sm80_s1688trmm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_trmm_sm80_s1688trmm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_trmm_sm80_z884trmm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_trmm_sm80_z884trmm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_trmm_sm90_d1684trmm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_trmm_sm90_d1684trmm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_trmm_sm90_gz1684trmm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_trmm_sm90_gz1684trmm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_trmm_sm90_z1684trmm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_trmm_sm90_z1684trmm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm80_c1688hemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm80_c1688hemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm80_c1688symm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm80_c1688symm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm80_c1688tf32hemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm80_c1688tf32hemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm80_c1688tf32symm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm80_c1688tf32symm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm80_d884symm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm80_d884symm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm80_gz884hemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm80_gz884hemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm80_gz884symm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm80_gz884symm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm80_s1688symm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm80_s1688symm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm80_s1688tf32symm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm80_s1688tf32symm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm80_z884hemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm80_z884hemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm80_z884symm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm80_z884symm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm90_d1684symm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm90_d1684symm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm90_gz1684hemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm90_gz1684hemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm90_gz1684symm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm90_gz1684symm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm90_z1684hemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm90_z1684hemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm90_z1684symm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm90_z1684symm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/share/info/cutlass/generated_kernels.txt -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/bin/cutlass_profiler -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/test/cutlass/ctest/ctest_profiler/CTestTestfile.ctest_profiler.cmake -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/test/cutlass/CTestTestfile.cmake -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/cmake/NvidiaCutlass/NvidiaCutlassConfig.cmake -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/cmake/NvidiaCutlass/NvidiaCutlassConfigVersion.cmake -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/cmake/NvidiaCutlass/NvidiaCutlassTargets.cmake -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/cmake/NvidiaCutlass/NvidiaCutlassTargets-release.cmake + popd ~/build/BUILD/cutlass + rm -rf /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/test + rm -rf /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/share/info + set +x Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/bin/cutlass_profiler Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm50_cf32_cdgrad_optimized_cf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm50_cf32_cfprop_optimized_cf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm50_cf32_cwgrad_optimized_cf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm50_sdgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm50_sfprop_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm50_swgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm60_hfprop_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm70_f16_s884dgrad_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm70_f16_s884fprop_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm70_f16_s884wgrad_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm70_h884dgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm70_h884fprop_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm70_h884wgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm70_s884dgrad_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm70_s884fprop_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm70_s884wgrad_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_cf32_cdgrad_optimized_cf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_cf32_cfprop_optimized_cf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_cf32_cwgrad_optimized_cf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_f16_s1688dgrad_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_f16_s1688fprop_few_channels_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_f16_s1688fprop_fixed_channels_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_f16_s1688fprop_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_f16_s1688wgrad_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_h1688dgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_h1688fprop_few_channels.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_h1688fprop_fixed_channels.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_h1688fprop_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_h1688wgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_i8816fprop_optimized_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_i8816fprop_optimized_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_i8832fprop_optimized_s4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_i8832fprop_optimized_u4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_s1688dgrad_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_s1688fprop_few_channels_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_s1688fprop_fixed_channels_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_s1688fprop_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_s1688wgrad_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_s4_i8832fprop_optimized_s4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_s8_i8816fprop_few_channels_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_s8_i8816fprop_fixed_channels_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_s8_i8816fprop_optimized_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_u4_i8832fprop_optimized_u4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_u8_i8816fprop_few_channels_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_u8_i8816fprop_fixed_channels_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm75_u8_i8816fprop_optimized_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_bf16_s16816dgrad_optimized_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_bf16_s16816fprop_fixed_channels_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_bf16_s16816fprop_optimized_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_bf16_s16816wgrad_optimized_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_f16_s16816dgrad_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_f16_s16816fprop_fixed_channels_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_f16_s16816fprop_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_f16_s16816wgrad_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_h16816dgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_h16816fprop_fixed_channels.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_h16816fprop_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_h16816wgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_i16832fprop_optimized_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_i16832fprop_optimized_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_i16864fprop_optimized_s4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_i16864fprop_optimized_u4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s16816dgrad_optimized_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s16816dgrad_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s16816fprop_fixed_channels_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s16816fprop_fixed_channels_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s16816fprop_optimized_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s16816fprop_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s16816wgrad_optimized_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s16816wgrad_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688bf16dgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688bf16fprop_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688bf16wgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688dgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688dgrad_optimized_tf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688f16dgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688f16fprop_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688f16wgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688fprop_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688fprop_optimized_tf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688tf32dgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688tf32fprop_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688tf32wgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688wgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s1688wgrad_optimized_tf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s4_i16864fprop_optimized_s4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s8_i16832fprop_few_channels_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s8_i16832fprop_fixed_channels_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_s8_i16832fprop_optimized_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_sdgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_sfprop_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_swgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_tf32_s1688dgrad_optimized_tf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_tf32_s1688fprop_optimized_tf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_tf32_s1688wgrad_optimized_tf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_u4_i16864fprop_optimized_u4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_u8_i16832fprop_few_channels_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_u8_i16832fprop_fixed_channels_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm80_u8_i16832fprop_optimized_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm89_s16832fprop_optimized_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm89_s16832fprop_optimized_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_bf16_s16816dgrad3d_analytic_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_bf16_s16816dgrad3d_optimized_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_bf16_s16816fprop3d_optimized_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_bf16_s16816wgrad3d_optimized_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_f16_s16816dgrad3d_analytic_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_f16_s16816dgrad3d_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_f16_s16816fprop3d_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_f16_s16816wgrad3d_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_h16816dgrad3d_analytic.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_h16816dgrad3d_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_h16816fprop3d_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_h16816wgrad3d_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_s16816dgrad3d_analytic_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_s16816dgrad3d_analytic_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_s16816dgrad3d_optimized_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_s16816dgrad3d_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_s16816fprop3d_optimized_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_s16816fprop3d_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_s16816wgrad3d_optimized_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm80_s16816wgrad3d_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm50_cgemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm50_dgemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm50_sgemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm60_hgemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm61_igemm_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm61_s8_igemm_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm70_f16_s884gemm_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm70_f16_s884gemm_planar_complex_array_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm70_f16_s884gemm_planar_complex_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm70_h884gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm70_h884gemm_planar_complex.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm70_h884gemm_planar_complex_array.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm70_s884gemm_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm70_s884gemm_planar_complex_array_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm70_s884gemm_planar_complex_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_f16_s1688gemm_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_array_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_h1688gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_h1688gemm_planar_complex.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_h1688gemm_planar_complex_array.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_i88128xorgemm_b1.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_i8816gemm_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_i8816gemm_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_i8832gemm_s4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_i8832gemm_u4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_s1688gemm_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_s1688gemm_planar_complex_array_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_s1688gemm_planar_complex_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_s4_i8832gemm_s4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_s8_i8816gemm_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_u4_i8832gemm_u4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm75_u8_i8816gemm_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_bf16_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_bf16_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_s8_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_u8_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_bf16_s16832spgemm_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_c1688gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_c1688tf32gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_cgemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_d884gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_dgemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_f16_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_f16_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_array_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_s8_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_u8_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_f16_s16832spgemm_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_gz884gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_h16816gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_h16816gemm_grouped.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_h16816gemm_planar_complex.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_h16816gemm_planar_complex_array.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_h16816gemm_s8_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_h16832spgemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_i168128spgemm_s4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_i168256andgemm_b1.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_i168256xorgemm_b1.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_i16832gemm_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_i16832gemm_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_i16864gemm_s4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_i16864gemm_u4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_i16864spgemm_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_bf16_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_bf16_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_f16_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_f16_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_grouped_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_grouped_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_planar_complex_array_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_planar_complex_array_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_planar_complex_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_planar_complex_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_s8_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_s8_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_u8_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816gemm_u8_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16816tf32spgemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16832spgemm_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s16832spgemm_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s1688bf16gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s1688f16gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s1688gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s1688gemm_tf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s1688tf32gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s4_i168128spgemm_s4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s4_i16864gemm_s4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s8_i16832gemm_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_s8_i16864spgemm_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_sgemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_tf32_s1688gemm_tf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_u4_i16864gemm_u4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_u8_i16832gemm_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm80_z884gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16832gemm_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16832gemm_e4m3_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16832gemm_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16832gemm_e5m2_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16864spgemm_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16864spgemm_e4m3_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16864spgemm_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm89_s16864spgemm_e5m2_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x16gemm_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_d1684gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x16gemm_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_gz1684gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_h64x128x16gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_i64x128x32gemm_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_i64x128x32gemm_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_s64x128x16gemm_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_s64x128x16gemm_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_s64x128x32gemm_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_s64x128x32gemm_e4m3_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_s64x128x32gemm_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_s64x128x32gemm_e5m2_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_s64x128x8gemm_tf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_s64x128x8tf32gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_s8_i64x128x32gemm_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_s8_i64x128x32gemm_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_void_i64x128x32gemm_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_void_i64x128x32gemm_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_void_s64x128x16gemm_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_void_s64x128x16gemm_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_gemm_sm90_z1684gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm80_c1688her2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm80_c1688syr2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm80_c1688tf32her2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm80_c1688tf32syr2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm80_d884syr2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm80_gz884her2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm80_gz884syr2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm80_s1688syr2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm80_s1688tf32syr2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm80_z884her2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm80_z884syr2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm90_d1684syr2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm90_gz1684her2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm90_gz1684syr2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm90_z1684her2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_2k_sm90_z1684syr2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm80_c1688herk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm80_c1688syrk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm80_c1688tf32herk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm80_c1688tf32syrk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm80_d884syrk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm80_gz884herk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm80_gz884syrk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm80_s1688syrk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm80_s1688tf32syrk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm80_z884herk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm80_z884syrk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm90_d1684syrk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm90_gz1684herk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm90_gz1684syrk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm90_z1684herk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_rank_k_sm90_z1684syrk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm80_c1688hemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm80_c1688symm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm80_c1688tf32hemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm80_c1688tf32symm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm80_d884symm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm80_gz884hemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm80_gz884symm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm80_s1688symm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm80_s1688tf32symm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm80_z884hemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm80_z884symm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm90_d1684symm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm90_gz1684hemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm90_gz1684symm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm90_z1684hemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_symm_sm90_z1684symm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_trmm_sm80_c1688tf32trmm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_trmm_sm80_c1688trmm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_trmm_sm80_d884trmm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_trmm_sm80_gz884trmm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_trmm_sm80_s1688tf32trmm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_trmm_sm80_s1688trmm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_trmm_sm80_z884trmm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_trmm_sm90_d1684trmm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_trmm_sm90_gz1684trmm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/lib64/libcutlass_trmm_sm90_z1684trmm.so + /usr/lib/rpm/check-buildroot + /usr/lib/rpm/redhat/brp-ldconfig /sbin/ldconfig: Warning: ignoring configuration file that cannot be opened: /etc/ld.so.conf: No such file or directory + /usr/lib/rpm/brp-compress + /usr/lib/rpm/brp-strip /usr/bin/strip + /usr/lib/rpm/brp-strip-comment-note /usr/bin/strip /usr/bin/objdump + /usr/lib/rpm/brp-strip-static-archive /usr/bin/strip + /usr/lib/rpm/brp-python-bytecompile '' 1 + /usr/lib/rpm/brp-python-hardlink + PYTHON3=/usr/bin/python3.6 + /usr/lib/rpm/redhat/brp-mangle-shebangs Processing files: cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64 Executing(%doc): /bin/sh -e /var/tmp/rpm-tmp.iSZRpZ + umask 022 + cd /builddir/build/BUILD + cd cutlass + DOCDIR=/builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/share/doc/cutlass + export LC_ALL=C + LC_ALL=C + export DOCDIR + /usr/bin/mkdir -p /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/share/doc/cutlass + cp -pr README.md /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/share/doc/cutlass + cp -pr docs /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/share/doc/cutlass + exit 0 Executing(%license): /bin/sh -e /var/tmp/rpm-tmp.0h2eQ6 + umask 022 + cd /builddir/build/BUILD + cd cutlass + LICENSEDIR=/builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/share/licenses/cutlass + export LC_ALL=C + LC_ALL=C + export LICENSEDIR + /usr/bin/mkdir -p /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/share/licenses/cutlass + cp -pr LICENSE.txt /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64/usr/share/licenses/cutlass + exit 0 Provides: cutlass = 3.5.0-20240411.1.cu12_4.el8 cutlass(aarch-64) = 3.5.0-20240411.1.cu12_4.el8 libcutlass.so()(64bit) libcutlass_conv2d_sm50_cf32_cdgrad_optimized_cf32.so()(64bit) libcutlass_conv2d_sm50_cf32_cfprop_optimized_cf32.so()(64bit) libcutlass_conv2d_sm50_cf32_cwgrad_optimized_cf32.so()(64bit) libcutlass_conv2d_sm50_sdgrad_optimized.so()(64bit) libcutlass_conv2d_sm50_sfprop_optimized.so()(64bit) libcutlass_conv2d_sm50_swgrad_optimized.so()(64bit) libcutlass_conv2d_sm60_hfprop_optimized.so()(64bit) libcutlass_conv2d_sm70_f16_s884dgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm70_f16_s884fprop_optimized_f16.so()(64bit) libcutlass_conv2d_sm70_f16_s884wgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm70_h884dgrad_optimized.so()(64bit) libcutlass_conv2d_sm70_h884fprop_optimized.so()(64bit) libcutlass_conv2d_sm70_h884wgrad_optimized.so()(64bit) libcutlass_conv2d_sm70_s884dgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm70_s884fprop_optimized_f16.so()(64bit) libcutlass_conv2d_sm70_s884wgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_cf32_cdgrad_optimized_cf32.so()(64bit) libcutlass_conv2d_sm75_cf32_cfprop_optimized_cf32.so()(64bit) libcutlass_conv2d_sm75_cf32_cwgrad_optimized_cf32.so()(64bit) libcutlass_conv2d_sm75_f16_s1688dgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_f16_s1688fprop_few_channels_f16.so()(64bit) libcutlass_conv2d_sm75_f16_s1688fprop_fixed_channels_f16.so()(64bit) libcutlass_conv2d_sm75_f16_s1688fprop_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_f16_s1688wgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_h1688dgrad_optimized.so()(64bit) libcutlass_conv2d_sm75_h1688fprop_few_channels.so()(64bit) libcutlass_conv2d_sm75_h1688fprop_fixed_channels.so()(64bit) libcutlass_conv2d_sm75_h1688fprop_optimized.so()(64bit) libcutlass_conv2d_sm75_h1688wgrad_optimized.so()(64bit) libcutlass_conv2d_sm75_i8816fprop_optimized_s8.so()(64bit) libcutlass_conv2d_sm75_i8816fprop_optimized_u8.so()(64bit) libcutlass_conv2d_sm75_i8832fprop_optimized_s4.so()(64bit) libcutlass_conv2d_sm75_i8832fprop_optimized_u4.so()(64bit) libcutlass_conv2d_sm75_s1688dgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_s1688fprop_few_channels_f16.so()(64bit) libcutlass_conv2d_sm75_s1688fprop_fixed_channels_f16.so()(64bit) libcutlass_conv2d_sm75_s1688fprop_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_s1688wgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_s4_i8832fprop_optimized_s4.so()(64bit) libcutlass_conv2d_sm75_s8_i8816fprop_few_channels_s8.so()(64bit) libcutlass_conv2d_sm75_s8_i8816fprop_fixed_channels_s8.so()(64bit) libcutlass_conv2d_sm75_s8_i8816fprop_optimized_s8.so()(64bit) libcutlass_conv2d_sm75_u4_i8832fprop_optimized_u4.so()(64bit) libcutlass_conv2d_sm75_u8_i8816fprop_few_channels_u8.so()(64bit) libcutlass_conv2d_sm75_u8_i8816fprop_fixed_channels_u8.so()(64bit) libcutlass_conv2d_sm75_u8_i8816fprop_optimized_u8.so()(64bit) libcutlass_conv2d_sm80_bf16_s16816dgrad_optimized_bf16.so()(64bit) libcutlass_conv2d_sm80_bf16_s16816fprop_fixed_channels_bf16.so()(64bit) libcutlass_conv2d_sm80_bf16_s16816fprop_optimized_bf16.so()(64bit) libcutlass_conv2d_sm80_bf16_s16816wgrad_optimized_bf16.so()(64bit) libcutlass_conv2d_sm80_f16_s16816dgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm80_f16_s16816fprop_fixed_channels_f16.so()(64bit) libcutlass_conv2d_sm80_f16_s16816fprop_optimized_f16.so()(64bit) libcutlass_conv2d_sm80_f16_s16816wgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm80_h16816dgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_h16816fprop_fixed_channels.so()(64bit) libcutlass_conv2d_sm80_h16816fprop_optimized.so()(64bit) libcutlass_conv2d_sm80_h16816wgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_i16832fprop_optimized_s8.so()(64bit) libcutlass_conv2d_sm80_i16832fprop_optimized_u8.so()(64bit) libcutlass_conv2d_sm80_i16864fprop_optimized_s4.so()(64bit) libcutlass_conv2d_sm80_i16864fprop_optimized_u4.so()(64bit) libcutlass_conv2d_sm80_s16816dgrad_optimized_bf16.so()(64bit) libcutlass_conv2d_sm80_s16816dgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm80_s16816fprop_fixed_channels_bf16.so()(64bit) libcutlass_conv2d_sm80_s16816fprop_fixed_channels_f16.so()(64bit) libcutlass_conv2d_sm80_s16816fprop_optimized_bf16.so()(64bit) libcutlass_conv2d_sm80_s16816fprop_optimized_f16.so()(64bit) libcutlass_conv2d_sm80_s16816wgrad_optimized_bf16.so()(64bit) libcutlass_conv2d_sm80_s16816wgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm80_s1688bf16dgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688bf16fprop_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688bf16wgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688dgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688dgrad_optimized_tf32.so()(64bit) libcutlass_conv2d_sm80_s1688f16dgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688f16fprop_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688f16wgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688fprop_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688fprop_optimized_tf32.so()(64bit) libcutlass_conv2d_sm80_s1688tf32dgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688tf32fprop_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688tf32wgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688wgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688wgrad_optimized_tf32.so()(64bit) libcutlass_conv2d_sm80_s4_i16864fprop_optimized_s4.so()(64bit) libcutlass_conv2d_sm80_s8_i16832fprop_few_channels_s8.so()(64bit) libcutlass_conv2d_sm80_s8_i16832fprop_fixed_channels_s8.so()(64bit) libcutlass_conv2d_sm80_s8_i16832fprop_optimized_s8.so()(64bit) libcutlass_conv2d_sm80_sdgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_sfprop_optimized.so()(64bit) libcutlass_conv2d_sm80_swgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_tf32_s1688dgrad_optimized_tf32.so()(64bit) libcutlass_conv2d_sm80_tf32_s1688fprop_optimized_tf32.so()(64bit) libcutlass_conv2d_sm80_tf32_s1688wgrad_optimized_tf32.so()(64bit) libcutlass_conv2d_sm80_u4_i16864fprop_optimized_u4.so()(64bit) libcutlass_conv2d_sm80_u8_i16832fprop_few_channels_u8.so()(64bit) libcutlass_conv2d_sm80_u8_i16832fprop_fixed_channels_u8.so()(64bit) libcutlass_conv2d_sm80_u8_i16832fprop_optimized_u8.so()(64bit) libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e4m3.so()(64bit) libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e5m2.so()(64bit) libcutlass_conv2d_sm89_s16832fprop_optimized_e4m3.so()(64bit) libcutlass_conv2d_sm89_s16832fprop_optimized_e5m2.so()(64bit) libcutlass_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16.so()(64bit) libcutlass_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16.so()(64bit) libcutlass_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32.so()(64bit) libcutlass_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32.so()(64bit) libcutlass_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32.so()(64bit) libcutlass_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32.so()(64bit) libcutlass_conv3d_sm80_bf16_s16816dgrad3d_analytic_bf16.so()(64bit) libcutlass_conv3d_sm80_bf16_s16816dgrad3d_optimized_bf16.so()(64bit) libcutlass_conv3d_sm80_bf16_s16816fprop3d_optimized_bf16.so()(64bit) libcutlass_conv3d_sm80_bf16_s16816wgrad3d_optimized_bf16.so()(64bit) libcutlass_conv3d_sm80_f16_s16816dgrad3d_analytic_f16.so()(64bit) libcutlass_conv3d_sm80_f16_s16816dgrad3d_optimized_f16.so()(64bit) libcutlass_conv3d_sm80_f16_s16816fprop3d_optimized_f16.so()(64bit) libcutlass_conv3d_sm80_f16_s16816wgrad3d_optimized_f16.so()(64bit) libcutlass_conv3d_sm80_h16816dgrad3d_analytic.so()(64bit) libcutlass_conv3d_sm80_h16816dgrad3d_optimized.so()(64bit) libcutlass_conv3d_sm80_h16816fprop3d_optimized.so()(64bit) libcutlass_conv3d_sm80_h16816wgrad3d_optimized.so()(64bit) libcutlass_conv3d_sm80_s16816dgrad3d_analytic_bf16.so()(64bit) libcutlass_conv3d_sm80_s16816dgrad3d_analytic_f16.so()(64bit) libcutlass_conv3d_sm80_s16816dgrad3d_optimized_bf16.so()(64bit) libcutlass_conv3d_sm80_s16816dgrad3d_optimized_f16.so()(64bit) libcutlass_conv3d_sm80_s16816fprop3d_optimized_bf16.so()(64bit) libcutlass_conv3d_sm80_s16816fprop3d_optimized_f16.so()(64bit) libcutlass_conv3d_sm80_s16816wgrad3d_optimized_bf16.so()(64bit) libcutlass_conv3d_sm80_s16816wgrad3d_optimized_f16.so()(64bit) libcutlass_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16.so()(64bit) libcutlass_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16.so()(64bit) libcutlass_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32.so()(64bit) libcutlass_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32.so()(64bit) libcutlass_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32.so()(64bit) libcutlass_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32.so()(64bit) libcutlass_gemm_sm50_cgemm.so()(64bit) libcutlass_gemm_sm50_dgemm.so()(64bit) libcutlass_gemm_sm50_sgemm.so()(64bit) libcutlass_gemm_sm60_hgemm.so()(64bit) libcutlass_gemm_sm61_igemm_s8.so()(64bit) libcutlass_gemm_sm61_s8_igemm_s8.so()(64bit) libcutlass_gemm_sm70_f16_s884gemm_f16.so()(64bit) libcutlass_gemm_sm70_f16_s884gemm_planar_complex_array_f16.so()(64bit) libcutlass_gemm_sm70_f16_s884gemm_planar_complex_f16.so()(64bit) libcutlass_gemm_sm70_h884gemm.so()(64bit) libcutlass_gemm_sm70_h884gemm_planar_complex.so()(64bit) libcutlass_gemm_sm70_h884gemm_planar_complex_array.so()(64bit) libcutlass_gemm_sm70_s884gemm_f16.so()(64bit) libcutlass_gemm_sm70_s884gemm_planar_complex_array_f16.so()(64bit) libcutlass_gemm_sm70_s884gemm_planar_complex_f16.so()(64bit) libcutlass_gemm_sm75_f16_s1688gemm_f16.so()(64bit) libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_array_f16.so()(64bit) libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_f16.so()(64bit) libcutlass_gemm_sm75_h1688gemm.so()(64bit) libcutlass_gemm_sm75_h1688gemm_planar_complex.so()(64bit) libcutlass_gemm_sm75_h1688gemm_planar_complex_array.so()(64bit) libcutlass_gemm_sm75_i88128xorgemm_b1.so()(64bit) libcutlass_gemm_sm75_i8816gemm_s8.so()(64bit) libcutlass_gemm_sm75_i8816gemm_u8.so()(64bit) libcutlass_gemm_sm75_i8832gemm_s4.so()(64bit) libcutlass_gemm_sm75_i8832gemm_u4.so()(64bit) libcutlass_gemm_sm75_s1688gemm_f16.so()(64bit) libcutlass_gemm_sm75_s1688gemm_planar_complex_array_f16.so()(64bit) libcutlass_gemm_sm75_s1688gemm_planar_complex_f16.so()(64bit) libcutlass_gemm_sm75_s4_i8832gemm_s4.so()(64bit) libcutlass_gemm_sm75_s8_i8816gemm_s8.so()(64bit) libcutlass_gemm_sm75_u4_i8832gemm_u4.so()(64bit) libcutlass_gemm_sm75_u8_i8816gemm_u8.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_bf16.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_bf16_s8.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_bf16_u8.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_bf16.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_s8_bf16.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_u8_bf16.so()(64bit) libcutlass_gemm_sm80_bf16_s16832spgemm_bf16.so()(64bit) libcutlass_gemm_sm80_c1688gemm.so()(64bit) libcutlass_gemm_sm80_c1688tf32gemm.so()(64bit) libcutlass_gemm_sm80_cgemm.so()(64bit) libcutlass_gemm_sm80_d884gemm.so()(64bit) libcutlass_gemm_sm80_dgemm.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_f16.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_f16_s8.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_f16_u8.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_array_f16.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_f16.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_s8_f16.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_u8_f16.so()(64bit) libcutlass_gemm_sm80_f16_s16832spgemm_f16.so()(64bit) libcutlass_gemm_sm80_gz884gemm.so()(64bit) libcutlass_gemm_sm80_h16816gemm.so()(64bit) libcutlass_gemm_sm80_h16816gemm_grouped.so()(64bit) libcutlass_gemm_sm80_h16816gemm_planar_complex.so()(64bit) libcutlass_gemm_sm80_h16816gemm_planar_complex_array.so()(64bit) libcutlass_gemm_sm80_h16816gemm_s8_f16.so()(64bit) libcutlass_gemm_sm80_h16832spgemm.so()(64bit) libcutlass_gemm_sm80_i168128spgemm_s4.so()(64bit) libcutlass_gemm_sm80_i168256andgemm_b1.so()(64bit) libcutlass_gemm_sm80_i168256xorgemm_b1.so()(64bit) libcutlass_gemm_sm80_i16832gemm_s8.so()(64bit) libcutlass_gemm_sm80_i16832gemm_u8.so()(64bit) libcutlass_gemm_sm80_i16864gemm_s4.so()(64bit) libcutlass_gemm_sm80_i16864gemm_u4.so()(64bit) libcutlass_gemm_sm80_i16864spgemm_s8.so()(64bit) libcutlass_gemm_sm80_s16816gemm_bf16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_bf16_s8.so()(64bit) libcutlass_gemm_sm80_s16816gemm_bf16_u8.so()(64bit) libcutlass_gemm_sm80_s16816gemm_f16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_f16_s8.so()(64bit) libcutlass_gemm_sm80_s16816gemm_f16_u8.so()(64bit) libcutlass_gemm_sm80_s16816gemm_grouped_bf16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_grouped_f16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_planar_complex_array_bf16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_planar_complex_array_f16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_planar_complex_bf16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_planar_complex_f16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_s8_bf16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_s8_f16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_u8_bf16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_u8_f16.so()(64bit) libcutlass_gemm_sm80_s16816tf32spgemm.so()(64bit) libcutlass_gemm_sm80_s16832spgemm_bf16.so()(64bit) libcutlass_gemm_sm80_s16832spgemm_f16.so()(64bit) libcutlass_gemm_sm80_s1688bf16gemm.so()(64bit) libcutlass_gemm_sm80_s1688f16gemm.so()(64bit) libcutlass_gemm_sm80_s1688gemm.so()(64bit) libcutlass_gemm_sm80_s1688gemm_tf32.so()(64bit) libcutlass_gemm_sm80_s1688tf32gemm.so()(64bit) libcutlass_gemm_sm80_s4_i168128spgemm_s4.so()(64bit) libcutlass_gemm_sm80_s4_i16864gemm_s4.so()(64bit) libcutlass_gemm_sm80_s8_i16832gemm_s8.so()(64bit) libcutlass_gemm_sm80_s8_i16864spgemm_s8.so()(64bit) libcutlass_gemm_sm80_sgemm.so()(64bit) libcutlass_gemm_sm80_tf32_s1688gemm_tf32.so()(64bit) libcutlass_gemm_sm80_u4_i16864gemm_u4.so()(64bit) libcutlass_gemm_sm80_u8_i16832gemm_u8.so()(64bit) libcutlass_gemm_sm80_z884gemm.so()(64bit) libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3.so()(64bit) libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2.so()(64bit) libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm89_s16832gemm_e4m3.so()(64bit) libcutlass_gemm_sm89_s16832gemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm89_s16832gemm_e5m2.so()(64bit) libcutlass_gemm_sm89_s16832gemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3.so()(64bit) libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2.so()(64bit) libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm89_s16864spgemm_e4m3.so()(64bit) libcutlass_gemm_sm89_s16864spgemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm89_s16864spgemm_e5m2.so()(64bit) libcutlass_gemm_sm89_s16864spgemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm90_bf16_s64x128x16gemm_bf16.so()(64bit) libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3.so()(64bit) libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2.so()(64bit) libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm90_d1684gemm.so()(64bit) libcutlass_gemm_sm90_f16_s64x128x16gemm_f16.so()(64bit) libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3.so()(64bit) libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2.so()(64bit) libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm90_gz1684gemm.so()(64bit) libcutlass_gemm_sm90_h64x128x16gemm.so()(64bit) libcutlass_gemm_sm90_i64x128x32gemm_s8.so()(64bit) libcutlass_gemm_sm90_i64x128x32gemm_u8.so()(64bit) libcutlass_gemm_sm90_s64x128x16gemm_bf16.so()(64bit) libcutlass_gemm_sm90_s64x128x16gemm_f16.so()(64bit) libcutlass_gemm_sm90_s64x128x32gemm_e4m3.so()(64bit) libcutlass_gemm_sm90_s64x128x32gemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm90_s64x128x32gemm_e5m2.so()(64bit) libcutlass_gemm_sm90_s64x128x32gemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm90_s64x128x8gemm_tf32.so()(64bit) libcutlass_gemm_sm90_s64x128x8tf32gemm.so()(64bit) libcutlass_gemm_sm90_s8_i64x128x32gemm_s8.so()(64bit) libcutlass_gemm_sm90_s8_i64x128x32gemm_u8.so()(64bit) libcutlass_gemm_sm90_void_i64x128x32gemm_s8.so()(64bit) libcutlass_gemm_sm90_void_i64x128x32gemm_u8.so()(64bit) libcutlass_gemm_sm90_void_s64x128x16gemm_bf16.so()(64bit) libcutlass_gemm_sm90_void_s64x128x16gemm_f16.so()(64bit) libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3.so()(64bit) libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2.so()(64bit) libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm90_z1684gemm.so()(64bit) libcutlass_rank_2k_sm80_c1688her2k.so()(64bit) libcutlass_rank_2k_sm80_c1688syr2k.so()(64bit) libcutlass_rank_2k_sm80_c1688tf32her2k.so()(64bit) libcutlass_rank_2k_sm80_c1688tf32syr2k.so()(64bit) libcutlass_rank_2k_sm80_d884syr2k.so()(64bit) libcutlass_rank_2k_sm80_gz884her2k.so()(64bit) libcutlass_rank_2k_sm80_gz884syr2k.so()(64bit) libcutlass_rank_2k_sm80_s1688syr2k.so()(64bit) libcutlass_rank_2k_sm80_s1688tf32syr2k.so()(64bit) libcutlass_rank_2k_sm80_z884her2k.so()(64bit) libcutlass_rank_2k_sm80_z884syr2k.so()(64bit) libcutlass_rank_2k_sm90_d1684syr2k.so()(64bit) libcutlass_rank_2k_sm90_gz1684her2k.so()(64bit) libcutlass_rank_2k_sm90_gz1684syr2k.so()(64bit) libcutlass_rank_2k_sm90_z1684her2k.so()(64bit) libcutlass_rank_2k_sm90_z1684syr2k.so()(64bit) libcutlass_rank_k_sm80_c1688herk.so()(64bit) libcutlass_rank_k_sm80_c1688syrk.so()(64bit) libcutlass_rank_k_sm80_c1688tf32herk.so()(64bit) libcutlass_rank_k_sm80_c1688tf32syrk.so()(64bit) libcutlass_rank_k_sm80_d884syrk.so()(64bit) libcutlass_rank_k_sm80_gz884herk.so()(64bit) libcutlass_rank_k_sm80_gz884syrk.so()(64bit) libcutlass_rank_k_sm80_s1688syrk.so()(64bit) libcutlass_rank_k_sm80_s1688tf32syrk.so()(64bit) libcutlass_rank_k_sm80_z884herk.so()(64bit) libcutlass_rank_k_sm80_z884syrk.so()(64bit) libcutlass_rank_k_sm90_d1684syrk.so()(64bit) libcutlass_rank_k_sm90_gz1684herk.so()(64bit) libcutlass_rank_k_sm90_gz1684syrk.so()(64bit) libcutlass_rank_k_sm90_z1684herk.so()(64bit) libcutlass_rank_k_sm90_z1684syrk.so()(64bit) libcutlass_symm_sm80_c1688hemm.so()(64bit) libcutlass_symm_sm80_c1688symm.so()(64bit) libcutlass_symm_sm80_c1688tf32hemm.so()(64bit) libcutlass_symm_sm80_c1688tf32symm.so()(64bit) libcutlass_symm_sm80_d884symm.so()(64bit) libcutlass_symm_sm80_gz884hemm.so()(64bit) libcutlass_symm_sm80_gz884symm.so()(64bit) libcutlass_symm_sm80_s1688symm.so()(64bit) libcutlass_symm_sm80_s1688tf32symm.so()(64bit) libcutlass_symm_sm80_z884hemm.so()(64bit) libcutlass_symm_sm80_z884symm.so()(64bit) libcutlass_symm_sm90_d1684symm.so()(64bit) libcutlass_symm_sm90_gz1684hemm.so()(64bit) libcutlass_symm_sm90_gz1684symm.so()(64bit) libcutlass_symm_sm90_z1684hemm.so()(64bit) libcutlass_symm_sm90_z1684symm.so()(64bit) libcutlass_trmm_sm80_c1688tf32trmm.so()(64bit) libcutlass_trmm_sm80_c1688trmm.so()(64bit) libcutlass_trmm_sm80_d884trmm.so()(64bit) libcutlass_trmm_sm80_gz884trmm.so()(64bit) libcutlass_trmm_sm80_s1688tf32trmm.so()(64bit) libcutlass_trmm_sm80_s1688trmm.so()(64bit) libcutlass_trmm_sm80_z884trmm.so()(64bit) libcutlass_trmm_sm90_d1684trmm.so()(64bit) libcutlass_trmm_sm90_gz1684trmm.so()(64bit) libcutlass_trmm_sm90_z1684trmm.so()(64bit) Requires(rpmlib): rpmlib(CompressedFileNames) <= 3.0.4-1 rpmlib(FileDigests) <= 4.6.0-1 rpmlib(PayloadFilesHavePrefix) <= 4.0-1 Requires: libc.so.6()(64bit) libc.so.6(GLIBC_2.17)(64bit) libcuda.so.1()(64bit) libcudart.so.12()(64bit) libcudart.so.12(libcudart.so.12)(64bit) libcutlass.so()(64bit) libcutlass_conv2d_sm50_cf32_cdgrad_optimized_cf32.so()(64bit) libcutlass_conv2d_sm50_cf32_cfprop_optimized_cf32.so()(64bit) libcutlass_conv2d_sm50_cf32_cwgrad_optimized_cf32.so()(64bit) libcutlass_conv2d_sm50_sdgrad_optimized.so()(64bit) libcutlass_conv2d_sm50_sfprop_optimized.so()(64bit) libcutlass_conv2d_sm50_swgrad_optimized.so()(64bit) libcutlass_conv2d_sm60_hfprop_optimized.so()(64bit) libcutlass_conv2d_sm70_f16_s884dgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm70_f16_s884fprop_optimized_f16.so()(64bit) libcutlass_conv2d_sm70_f16_s884wgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm70_h884dgrad_optimized.so()(64bit) libcutlass_conv2d_sm70_h884fprop_optimized.so()(64bit) libcutlass_conv2d_sm70_h884wgrad_optimized.so()(64bit) libcutlass_conv2d_sm70_s884dgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm70_s884fprop_optimized_f16.so()(64bit) libcutlass_conv2d_sm70_s884wgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_cf32_cdgrad_optimized_cf32.so()(64bit) libcutlass_conv2d_sm75_cf32_cfprop_optimized_cf32.so()(64bit) libcutlass_conv2d_sm75_cf32_cwgrad_optimized_cf32.so()(64bit) libcutlass_conv2d_sm75_f16_s1688dgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_f16_s1688fprop_few_channels_f16.so()(64bit) libcutlass_conv2d_sm75_f16_s1688fprop_fixed_channels_f16.so()(64bit) libcutlass_conv2d_sm75_f16_s1688fprop_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_f16_s1688wgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_h1688dgrad_optimized.so()(64bit) libcutlass_conv2d_sm75_h1688fprop_few_channels.so()(64bit) libcutlass_conv2d_sm75_h1688fprop_fixed_channels.so()(64bit) libcutlass_conv2d_sm75_h1688fprop_optimized.so()(64bit) libcutlass_conv2d_sm75_h1688wgrad_optimized.so()(64bit) libcutlass_conv2d_sm75_i8816fprop_optimized_s8.so()(64bit) libcutlass_conv2d_sm75_i8816fprop_optimized_u8.so()(64bit) libcutlass_conv2d_sm75_i8832fprop_optimized_s4.so()(64bit) libcutlass_conv2d_sm75_i8832fprop_optimized_u4.so()(64bit) libcutlass_conv2d_sm75_s1688dgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_s1688fprop_few_channels_f16.so()(64bit) libcutlass_conv2d_sm75_s1688fprop_fixed_channels_f16.so()(64bit) libcutlass_conv2d_sm75_s1688fprop_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_s1688wgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_s4_i8832fprop_optimized_s4.so()(64bit) libcutlass_conv2d_sm75_s8_i8816fprop_few_channels_s8.so()(64bit) libcutlass_conv2d_sm75_s8_i8816fprop_fixed_channels_s8.so()(64bit) libcutlass_conv2d_sm75_s8_i8816fprop_optimized_s8.so()(64bit) libcutlass_conv2d_sm75_u4_i8832fprop_optimized_u4.so()(64bit) libcutlass_conv2d_sm75_u8_i8816fprop_few_channels_u8.so()(64bit) libcutlass_conv2d_sm75_u8_i8816fprop_fixed_channels_u8.so()(64bit) libcutlass_conv2d_sm75_u8_i8816fprop_optimized_u8.so()(64bit) libcutlass_conv2d_sm80_bf16_s16816dgrad_optimized_bf16.so()(64bit) libcutlass_conv2d_sm80_bf16_s16816fprop_fixed_channels_bf16.so()(64bit) libcutlass_conv2d_sm80_bf16_s16816fprop_optimized_bf16.so()(64bit) libcutlass_conv2d_sm80_bf16_s16816wgrad_optimized_bf16.so()(64bit) libcutlass_conv2d_sm80_f16_s16816dgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm80_f16_s16816fprop_fixed_channels_f16.so()(64bit) libcutlass_conv2d_sm80_f16_s16816fprop_optimized_f16.so()(64bit) libcutlass_conv2d_sm80_f16_s16816wgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm80_h16816dgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_h16816fprop_fixed_channels.so()(64bit) libcutlass_conv2d_sm80_h16816fprop_optimized.so()(64bit) libcutlass_conv2d_sm80_h16816wgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_i16832fprop_optimized_s8.so()(64bit) libcutlass_conv2d_sm80_i16832fprop_optimized_u8.so()(64bit) libcutlass_conv2d_sm80_i16864fprop_optimized_s4.so()(64bit) libcutlass_conv2d_sm80_i16864fprop_optimized_u4.so()(64bit) libcutlass_conv2d_sm80_s16816dgrad_optimized_bf16.so()(64bit) libcutlass_conv2d_sm80_s16816dgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm80_s16816fprop_fixed_channels_bf16.so()(64bit) libcutlass_conv2d_sm80_s16816fprop_fixed_channels_f16.so()(64bit) libcutlass_conv2d_sm80_s16816fprop_optimized_bf16.so()(64bit) libcutlass_conv2d_sm80_s16816fprop_optimized_f16.so()(64bit) libcutlass_conv2d_sm80_s16816wgrad_optimized_bf16.so()(64bit) libcutlass_conv2d_sm80_s16816wgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm80_s1688bf16dgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688bf16fprop_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688bf16wgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688dgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688dgrad_optimized_tf32.so()(64bit) libcutlass_conv2d_sm80_s1688f16dgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688f16fprop_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688f16wgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688fprop_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688fprop_optimized_tf32.so()(64bit) libcutlass_conv2d_sm80_s1688tf32dgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688tf32fprop_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688tf32wgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688wgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688wgrad_optimized_tf32.so()(64bit) libcutlass_conv2d_sm80_s4_i16864fprop_optimized_s4.so()(64bit) libcutlass_conv2d_sm80_s8_i16832fprop_few_channels_s8.so()(64bit) libcutlass_conv2d_sm80_s8_i16832fprop_fixed_channels_s8.so()(64bit) libcutlass_conv2d_sm80_s8_i16832fprop_optimized_s8.so()(64bit) libcutlass_conv2d_sm80_sdgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_sfprop_optimized.so()(64bit) libcutlass_conv2d_sm80_swgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_tf32_s1688dgrad_optimized_tf32.so()(64bit) libcutlass_conv2d_sm80_tf32_s1688fprop_optimized_tf32.so()(64bit) libcutlass_conv2d_sm80_tf32_s1688wgrad_optimized_tf32.so()(64bit) libcutlass_conv2d_sm80_u4_i16864fprop_optimized_u4.so()(64bit) libcutlass_conv2d_sm80_u8_i16832fprop_few_channels_u8.so()(64bit) libcutlass_conv2d_sm80_u8_i16832fprop_fixed_channels_u8.so()(64bit) libcutlass_conv2d_sm80_u8_i16832fprop_optimized_u8.so()(64bit) libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e4m3.so()(64bit) libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e5m2.so()(64bit) libcutlass_conv2d_sm89_s16832fprop_optimized_e4m3.so()(64bit) libcutlass_conv2d_sm89_s16832fprop_optimized_e5m2.so()(64bit) libcutlass_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16.so()(64bit) libcutlass_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16.so()(64bit) libcutlass_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32.so()(64bit) libcutlass_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32.so()(64bit) libcutlass_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32.so()(64bit) libcutlass_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32.so()(64bit) libcutlass_conv3d_sm80_bf16_s16816dgrad3d_analytic_bf16.so()(64bit) libcutlass_conv3d_sm80_bf16_s16816dgrad3d_optimized_bf16.so()(64bit) libcutlass_conv3d_sm80_bf16_s16816fprop3d_optimized_bf16.so()(64bit) libcutlass_conv3d_sm80_bf16_s16816wgrad3d_optimized_bf16.so()(64bit) libcutlass_conv3d_sm80_f16_s16816dgrad3d_analytic_f16.so()(64bit) libcutlass_conv3d_sm80_f16_s16816dgrad3d_optimized_f16.so()(64bit) libcutlass_conv3d_sm80_f16_s16816fprop3d_optimized_f16.so()(64bit) libcutlass_conv3d_sm80_f16_s16816wgrad3d_optimized_f16.so()(64bit) libcutlass_conv3d_sm80_h16816dgrad3d_analytic.so()(64bit) libcutlass_conv3d_sm80_h16816dgrad3d_optimized.so()(64bit) libcutlass_conv3d_sm80_h16816fprop3d_optimized.so()(64bit) libcutlass_conv3d_sm80_h16816wgrad3d_optimized.so()(64bit) libcutlass_conv3d_sm80_s16816dgrad3d_analytic_bf16.so()(64bit) libcutlass_conv3d_sm80_s16816dgrad3d_analytic_f16.so()(64bit) libcutlass_conv3d_sm80_s16816dgrad3d_optimized_bf16.so()(64bit) libcutlass_conv3d_sm80_s16816dgrad3d_optimized_f16.so()(64bit) libcutlass_conv3d_sm80_s16816fprop3d_optimized_bf16.so()(64bit) libcutlass_conv3d_sm80_s16816fprop3d_optimized_f16.so()(64bit) libcutlass_conv3d_sm80_s16816wgrad3d_optimized_bf16.so()(64bit) libcutlass_conv3d_sm80_s16816wgrad3d_optimized_f16.so()(64bit) libcutlass_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16.so()(64bit) libcutlass_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16.so()(64bit) libcutlass_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32.so()(64bit) libcutlass_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32.so()(64bit) libcutlass_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32.so()(64bit) libcutlass_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32.so()(64bit) libcutlass_gemm_sm50_cgemm.so()(64bit) libcutlass_gemm_sm50_dgemm.so()(64bit) libcutlass_gemm_sm50_sgemm.so()(64bit) libcutlass_gemm_sm60_hgemm.so()(64bit) libcutlass_gemm_sm61_igemm_s8.so()(64bit) libcutlass_gemm_sm61_s8_igemm_s8.so()(64bit) libcutlass_gemm_sm70_f16_s884gemm_f16.so()(64bit) libcutlass_gemm_sm70_f16_s884gemm_planar_complex_array_f16.so()(64bit) libcutlass_gemm_sm70_f16_s884gemm_planar_complex_f16.so()(64bit) libcutlass_gemm_sm70_h884gemm.so()(64bit) libcutlass_gemm_sm70_h884gemm_planar_complex.so()(64bit) libcutlass_gemm_sm70_h884gemm_planar_complex_array.so()(64bit) libcutlass_gemm_sm70_s884gemm_f16.so()(64bit) libcutlass_gemm_sm70_s884gemm_planar_complex_array_f16.so()(64bit) libcutlass_gemm_sm70_s884gemm_planar_complex_f16.so()(64bit) libcutlass_gemm_sm75_f16_s1688gemm_f16.so()(64bit) libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_array_f16.so()(64bit) libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_f16.so()(64bit) libcutlass_gemm_sm75_h1688gemm.so()(64bit) libcutlass_gemm_sm75_h1688gemm_planar_complex.so()(64bit) libcutlass_gemm_sm75_h1688gemm_planar_complex_array.so()(64bit) libcutlass_gemm_sm75_i88128xorgemm_b1.so()(64bit) libcutlass_gemm_sm75_i8816gemm_s8.so()(64bit) libcutlass_gemm_sm75_i8816gemm_u8.so()(64bit) libcutlass_gemm_sm75_i8832gemm_s4.so()(64bit) libcutlass_gemm_sm75_i8832gemm_u4.so()(64bit) libcutlass_gemm_sm75_s1688gemm_f16.so()(64bit) libcutlass_gemm_sm75_s1688gemm_planar_complex_array_f16.so()(64bit) libcutlass_gemm_sm75_s1688gemm_planar_complex_f16.so()(64bit) libcutlass_gemm_sm75_s4_i8832gemm_s4.so()(64bit) libcutlass_gemm_sm75_s8_i8816gemm_s8.so()(64bit) libcutlass_gemm_sm75_u4_i8832gemm_u4.so()(64bit) libcutlass_gemm_sm75_u8_i8816gemm_u8.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_bf16.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_bf16_s8.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_bf16_u8.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_bf16.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_s8_bf16.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_u8_bf16.so()(64bit) libcutlass_gemm_sm80_bf16_s16832spgemm_bf16.so()(64bit) libcutlass_gemm_sm80_c1688gemm.so()(64bit) libcutlass_gemm_sm80_c1688tf32gemm.so()(64bit) libcutlass_gemm_sm80_cgemm.so()(64bit) libcutlass_gemm_sm80_d884gemm.so()(64bit) libcutlass_gemm_sm80_dgemm.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_f16.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_f16_s8.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_f16_u8.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_array_f16.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_f16.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_s8_f16.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_u8_f16.so()(64bit) libcutlass_gemm_sm80_f16_s16832spgemm_f16.so()(64bit) libcutlass_gemm_sm80_gz884gemm.so()(64bit) libcutlass_gemm_sm80_h16816gemm.so()(64bit) libcutlass_gemm_sm80_h16816gemm_grouped.so()(64bit) libcutlass_gemm_sm80_h16816gemm_planar_complex.so()(64bit) libcutlass_gemm_sm80_h16816gemm_planar_complex_array.so()(64bit) libcutlass_gemm_sm80_h16816gemm_s8_f16.so()(64bit) libcutlass_gemm_sm80_h16832spgemm.so()(64bit) libcutlass_gemm_sm80_i168128spgemm_s4.so()(64bit) libcutlass_gemm_sm80_i168256andgemm_b1.so()(64bit) libcutlass_gemm_sm80_i168256xorgemm_b1.so()(64bit) libcutlass_gemm_sm80_i16832gemm_s8.so()(64bit) libcutlass_gemm_sm80_i16832gemm_u8.so()(64bit) libcutlass_gemm_sm80_i16864gemm_s4.so()(64bit) libcutlass_gemm_sm80_i16864gemm_u4.so()(64bit) libcutlass_gemm_sm80_i16864spgemm_s8.so()(64bit) libcutlass_gemm_sm80_s16816gemm_bf16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_bf16_s8.so()(64bit) libcutlass_gemm_sm80_s16816gemm_bf16_u8.so()(64bit) libcutlass_gemm_sm80_s16816gemm_f16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_f16_s8.so()(64bit) libcutlass_gemm_sm80_s16816gemm_f16_u8.so()(64bit) libcutlass_gemm_sm80_s16816gemm_grouped_bf16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_grouped_f16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_planar_complex_array_bf16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_planar_complex_array_f16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_planar_complex_bf16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_planar_complex_f16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_s8_bf16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_s8_f16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_u8_bf16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_u8_f16.so()(64bit) libcutlass_gemm_sm80_s16816tf32spgemm.so()(64bit) libcutlass_gemm_sm80_s16832spgemm_bf16.so()(64bit) libcutlass_gemm_sm80_s16832spgemm_f16.so()(64bit) libcutlass_gemm_sm80_s1688bf16gemm.so()(64bit) libcutlass_gemm_sm80_s1688f16gemm.so()(64bit) libcutlass_gemm_sm80_s1688gemm.so()(64bit) libcutlass_gemm_sm80_s1688gemm_tf32.so()(64bit) libcutlass_gemm_sm80_s1688tf32gemm.so()(64bit) libcutlass_gemm_sm80_s4_i168128spgemm_s4.so()(64bit) libcutlass_gemm_sm80_s4_i16864gemm_s4.so()(64bit) libcutlass_gemm_sm80_s8_i16832gemm_s8.so()(64bit) libcutlass_gemm_sm80_s8_i16864spgemm_s8.so()(64bit) libcutlass_gemm_sm80_sgemm.so()(64bit) libcutlass_gemm_sm80_tf32_s1688gemm_tf32.so()(64bit) libcutlass_gemm_sm80_u4_i16864gemm_u4.so()(64bit) libcutlass_gemm_sm80_u8_i16832gemm_u8.so()(64bit) libcutlass_gemm_sm80_z884gemm.so()(64bit) libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3.so()(64bit) libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2.so()(64bit) libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm89_s16832gemm_e4m3.so()(64bit) libcutlass_gemm_sm89_s16832gemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm89_s16832gemm_e5m2.so()(64bit) libcutlass_gemm_sm89_s16832gemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3.so()(64bit) libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2.so()(64bit) libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm89_s16864spgemm_e4m3.so()(64bit) libcutlass_gemm_sm89_s16864spgemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm89_s16864spgemm_e5m2.so()(64bit) libcutlass_gemm_sm89_s16864spgemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm90_bf16_s64x128x16gemm_bf16.so()(64bit) libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3.so()(64bit) libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2.so()(64bit) libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm90_d1684gemm.so()(64bit) libcutlass_gemm_sm90_f16_s64x128x16gemm_f16.so()(64bit) libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3.so()(64bit) libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2.so()(64bit) libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm90_gz1684gemm.so()(64bit) libcutlass_gemm_sm90_h64x128x16gemm.so()(64bit) libcutlass_gemm_sm90_i64x128x32gemm_s8.so()(64bit) libcutlass_gemm_sm90_i64x128x32gemm_u8.so()(64bit) libcutlass_gemm_sm90_s64x128x16gemm_bf16.so()(64bit) libcutlass_gemm_sm90_s64x128x16gemm_f16.so()(64bit) libcutlass_gemm_sm90_s64x128x32gemm_e4m3.so()(64bit) libcutlass_gemm_sm90_s64x128x32gemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm90_s64x128x32gemm_e5m2.so()(64bit) libcutlass_gemm_sm90_s64x128x32gemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm90_s64x128x8gemm_tf32.so()(64bit) libcutlass_gemm_sm90_s64x128x8tf32gemm.so()(64bit) libcutlass_gemm_sm90_s8_i64x128x32gemm_s8.so()(64bit) libcutlass_gemm_sm90_s8_i64x128x32gemm_u8.so()(64bit) libcutlass_gemm_sm90_void_i64x128x32gemm_s8.so()(64bit) libcutlass_gemm_sm90_void_i64x128x32gemm_u8.so()(64bit) libcutlass_gemm_sm90_void_s64x128x16gemm_bf16.so()(64bit) libcutlass_gemm_sm90_void_s64x128x16gemm_f16.so()(64bit) libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3.so()(64bit) libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2.so()(64bit) libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm90_z1684gemm.so()(64bit) libcutlass_rank_2k_sm80_c1688her2k.so()(64bit) libcutlass_rank_2k_sm80_c1688syr2k.so()(64bit) libcutlass_rank_2k_sm80_c1688tf32her2k.so()(64bit) libcutlass_rank_2k_sm80_c1688tf32syr2k.so()(64bit) libcutlass_rank_2k_sm80_d884syr2k.so()(64bit) libcutlass_rank_2k_sm80_gz884her2k.so()(64bit) libcutlass_rank_2k_sm80_gz884syr2k.so()(64bit) libcutlass_rank_2k_sm80_s1688syr2k.so()(64bit) libcutlass_rank_2k_sm80_s1688tf32syr2k.so()(64bit) libcutlass_rank_2k_sm80_z884her2k.so()(64bit) libcutlass_rank_2k_sm80_z884syr2k.so()(64bit) libcutlass_rank_2k_sm90_d1684syr2k.so()(64bit) libcutlass_rank_2k_sm90_gz1684her2k.so()(64bit) libcutlass_rank_2k_sm90_gz1684syr2k.so()(64bit) libcutlass_rank_2k_sm90_z1684her2k.so()(64bit) libcutlass_rank_2k_sm90_z1684syr2k.so()(64bit) libcutlass_rank_k_sm80_c1688herk.so()(64bit) libcutlass_rank_k_sm80_c1688syrk.so()(64bit) libcutlass_rank_k_sm80_c1688tf32herk.so()(64bit) libcutlass_rank_k_sm80_c1688tf32syrk.so()(64bit) libcutlass_rank_k_sm80_d884syrk.so()(64bit) libcutlass_rank_k_sm80_gz884herk.so()(64bit) libcutlass_rank_k_sm80_gz884syrk.so()(64bit) libcutlass_rank_k_sm80_s1688syrk.so()(64bit) libcutlass_rank_k_sm80_s1688tf32syrk.so()(64bit) libcutlass_rank_k_sm80_z884herk.so()(64bit) libcutlass_rank_k_sm80_z884syrk.so()(64bit) libcutlass_rank_k_sm90_d1684syrk.so()(64bit) libcutlass_rank_k_sm90_gz1684herk.so()(64bit) libcutlass_rank_k_sm90_gz1684syrk.so()(64bit) libcutlass_rank_k_sm90_z1684herk.so()(64bit) libcutlass_rank_k_sm90_z1684syrk.so()(64bit) libcutlass_symm_sm80_c1688hemm.so()(64bit) libcutlass_symm_sm80_c1688symm.so()(64bit) libcutlass_symm_sm80_c1688tf32hemm.so()(64bit) libcutlass_symm_sm80_c1688tf32symm.so()(64bit) libcutlass_symm_sm80_d884symm.so()(64bit) libcutlass_symm_sm80_gz884hemm.so()(64bit) libcutlass_symm_sm80_gz884symm.so()(64bit) libcutlass_symm_sm80_s1688symm.so()(64bit) libcutlass_symm_sm80_s1688tf32symm.so()(64bit) libcutlass_symm_sm80_z884hemm.so()(64bit) libcutlass_symm_sm80_z884symm.so()(64bit) libcutlass_symm_sm90_d1684symm.so()(64bit) libcutlass_symm_sm90_gz1684hemm.so()(64bit) libcutlass_symm_sm90_gz1684symm.so()(64bit) libcutlass_symm_sm90_z1684hemm.so()(64bit) libcutlass_symm_sm90_z1684symm.so()(64bit) libcutlass_trmm_sm80_c1688tf32trmm.so()(64bit) libcutlass_trmm_sm80_c1688trmm.so()(64bit) libcutlass_trmm_sm80_d884trmm.so()(64bit) libcutlass_trmm_sm80_gz884trmm.so()(64bit) libcutlass_trmm_sm80_s1688tf32trmm.so()(64bit) libcutlass_trmm_sm80_s1688trmm.so()(64bit) libcutlass_trmm_sm80_z884trmm.so()(64bit) libcutlass_trmm_sm90_d1684trmm.so()(64bit) libcutlass_trmm_sm90_gz1684trmm.so()(64bit) libcutlass_trmm_sm90_z1684trmm.so()(64bit) libgcc_s.so.1()(64bit) libgcc_s.so.1(GCC_3.0)(64bit) libm.so.6()(64bit) libm.so.6(GLIBC_2.17)(64bit) libstdc++.so.6()(64bit) libstdc++.so.6(CXXABI_1.3)(64bit) libstdc++.so.6(CXXABI_1.3.5)(64bit) libstdc++.so.6(CXXABI_1.3.9)(64bit) libstdc++.so.6(GLIBCXX_3.4)(64bit) libstdc++.so.6(GLIBCXX_3.4.11)(64bit) libstdc++.so.6(GLIBCXX_3.4.15)(64bit) libstdc++.so.6(GLIBCXX_3.4.18)(64bit) libstdc++.so.6(GLIBCXX_3.4.20)(64bit) libstdc++.so.6(GLIBCXX_3.4.21)(64bit) libstdc++.so.6(GLIBCXX_3.4.5)(64bit) libstdc++.so.6(GLIBCXX_3.4.9)(64bit) rtld(GNU_HASH) Processing files: cutlass-devel-3.5.0-20240411.1.cu12_4.el8.aarch64 Provides: cmake(NvidiaCutlass) = 3.5.0 cmake(nvidiacutlass) = 3.5.0 cutlass-devel = 3.5.0-20240411.1.cu12_4.el8 cutlass-devel(aarch-64) = 3.5.0-20240411.1.cu12_4.el8 Requires(rpmlib): rpmlib(CompressedFileNames) <= 3.0.4-1 rpmlib(FileDigests) <= 4.6.0-1 rpmlib(PayloadFilesHavePrefix) <= 4.0-1 Requires: cmake-filesystem(aarch-64) Processing files: cutlass-static-3.5.0-20240411.1.cu12_4.el8.aarch64 Provides: cutlass-static = 3.5.0-20240411.1.cu12_4.el8 cutlass-static(aarch-64) = 3.5.0-20240411.1.cu12_4.el8 Requires(rpmlib): rpmlib(CompressedFileNames) <= 3.0.4-1 rpmlib(FileDigests) <= 4.6.0-1 rpmlib(PayloadFilesHavePrefix) <= 4.0-1 Checking for unpackaged file(s): /usr/lib/rpm/check-files /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64 Wrote: /builddir/build/RPMS/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64.rpm Wrote: /builddir/build/RPMS/cutlass-devel-3.5.0-20240411.1.cu12_4.el8.aarch64.rpm Wrote: /builddir/build/RPMS/cutlass-static-3.5.0-20240411.1.cu12_4.el8.aarch64.rpm Executing(%clean): /bin/sh -e /var/tmp/rpm-tmp.TspFXE + umask 022 + cd /builddir/build/BUILD + cd cutlass + /usr/bin/rm -rf /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.aarch64 + exit 0 Finish: rpmbuild cutlass-3.5.0-20240411.1.cu12_4.el8.src.rpm Finish: build phase for cutlass-3.5.0-20240411.1.cu12_4.el8.src.rpm INFO: chroot_scan: 3 files copied to /var/lib/copr-rpmbuild/results/chroot_scan INFO: /var/lib/mock/rhel+epel-8-aarch64-1713469169.948153/root/var/log/dnf.rpm.log /var/lib/mock/rhel+epel-8-aarch64-1713469169.948153/root/var/log/dnf.librepo.log /var/lib/mock/rhel+epel-8-aarch64-1713469169.948153/root/var/log/dnf.log INFO: Done(/var/lib/copr-rpmbuild/results/cutlass-3.5.0-20240411.1.cu12_4.el8.src.rpm) Config(child) 334 minutes 31 seconds INFO: Results and/or logs in: /var/lib/copr-rpmbuild/results INFO: Cleaning up build root ('cleanup_on_success=True') Start: clean chroot INFO: unmounting tmpfs. Finish: clean chroot Finish: run Running RPMResults tool Package info: { "packages": [ { "name": "cutlass-devel", "epoch": null, "version": "3.5.0", "release": "20240411.1.cu12_4.el8", "arch": "aarch64" }, { "name": "cutlass-static", "epoch": null, "version": "3.5.0", "release": "20240411.1.cu12_4.el8", "arch": "aarch64" }, { "name": "cutlass", "epoch": null, "version": "3.5.0", "release": "20240411.1.cu12_4.el8", "arch": "aarch64" }, { "name": "cutlass", "epoch": null, "version": "3.5.0", "release": "20240411.1.cu12_4.el8", "arch": "src" } ] } RPMResults finished